Covid-19 is caused by SARS-CoV-2, a ~30 kilobases long RNA virus. It’s genome was first sequenced in early January in China and since then the virus has been isolated and sequenced from patients belonging to all age groups from all over the world. Currently Nextstrain has over 3500 genome submissions of this virus, categorized into 10 known clades or variants, based on their genomic differences from each other. This viz showcases the prevalence of the different clades in India, arranged in the context of the infection timeline, including the distinct cluster A3i identified recently (Banu et al., 2020).
To summarize the large amount of data I used a packed bubble chart which allows efficient use of space. The closely packed circles represent the numeric values (number of viral sequences), while colors distinguish the variables - their clades. The data is segregated based on the gender as well as age group of the individual from whom the viral sample was isolated.
Based on analysis by the Centre for Cellular and Molecular Biology’s Bioinformatics Centre at Hyderabad, the graphic represents over 1500 viral genomes that could be isolated and sequenced successfully from Covid-19 patients in over 20 Indian states from March to June. A2a and A3i continue to be the dominant clades of India, although A2a - the globally predominant variant - is increasingly being observed in the Indian isolates. This viz and the details of the SARS-CoV-2 genome variants are hosted at Genome Evolution Analysis Resource for COVID-19 from where the data is sourced.
Many thanks to Tableau user Michael Petrey for his suggestion to use a color palette that is colorblind compatible. The earlier viz I made had a red-green selection for the two most prevalent clades, which can make it confusing. The Data Visualization Society Slack channel is a great resource to get feedback.
This viz helped me highlight the utility of viral genome sequencing in a recent article on “Testing and surveillance strategies in the context of COVID-19 in India” published in October 2020.
Do post your suggestions and comments on this thread!
Over 1500 SARS-COV-2 genomes now sequenced in India!
Clades A2a, with a prevalence similar to that seen globally, and A3i, the unique variant identified in https://t.co/zoTiBf0nVF continue to be the dominant clades in India. (1/2) pic.twitter.com/IIdhxcChyf
— CCMB (@ccmb_csir) July 23, 2020