updated
Note: Now updated daily
Description
Nextstrain.org displays
data about mutations in the SARS-CoV-2 RNA and protein sequences that have
occurred in different samples of the virus during the current 2019-2021 outbreak.
Nextstrain has a powerful user interface for viewing the evolutionary tree
that it infers from the patterns of mutations in sequences worldwide, but
does not offer a detailed plot of mutations along the genome
that can be correlated with other molecular information,
so we have processed their data into this track to display the mutations
called by Nextstrain for each sample that Nextstrain has obtained from
GISAID.
Click on the vertical column in the display for any position in the
SARS-CoV-2 genome to see more details about the mutation(s) that occur
at that position, including protein change (if applicable; protein
changes use gene names in the Nextstrain Genes track), number of
samples with the mutation, list giving the nucleotide (allele) for that
position in each GISAID sample, etc.
Nextstrain identifies certain clades within the phylogenetic tree
according to
a set of defining mutations.
The Nextstrain Clades
track provides more information about these clades
and serves as a useful color key for the clade colors in the phylogenetic tree display.
This track is composed of several subtracks so that different subsets of mutations may be viewed:
- Recurrent Bi-allelic: This is the only subtrack displayed by default.
It is limited to mutations that have been observed in at least two samples, and
excludes positions at which more than one alternate allele has been observed in
more than one sample.
- All: All mutations found in all samples.
- <Clade> Mutations: All mutations found in samples belonging to
<Clade>, which is one of Nextstrain's clades (19A, 19B, 20A, etc.)
Display Conventions
In "dense" mode, a vertical line is drawn at each position where there is a mutation.
In "pack" mode, the display shows a plot of all samples' mutations, with samples
ordered using Nextstrain's phylogenetic tree in order to highlight patterns of linkage.
Each sample is placed in a horizontal row of pixels; when the number of
samples exceeds the number of vertical pixels for the track, multiple
samples fall in the same pixel row and pixels are averaged across samples.
Each mutation is a vertical bar at its position in the SARS-CoV-2 genome
with white (invisible) representing the reference allele
and black representing the non-reference allele(s).
Tick marks are drawn at the top and bottom of each mutation's vertical bar
to make the bar more visible when most alleles are reference alleles.
Insertions and deletions are not shown as these are removed from the data
by Nextstrain.
The phylogenetic tree for the samples built by Nextstrain is depicted
in the left column of the display.
Mousing over this will show the GISAID identifiers for the different samples.
When the vertical height of the track is set sufficiently high
(10 pixels per sample with the default font),
sample names are drawn to the right of the tree; however, with thousands of
samples in the Nextstrain tree, and a maximum track height of 2500 pixels,
the full Nextstrain tree is too large for sample names to be displayed.
In the track controls, the user can choose to display subtracks containing
the phylogenetic trees and mutations for individual clades.
Some clades have few enough samples that they can be made tall enough to
display sample names.
Branches of the phylogenetic tree are colored by clade using the same
color scheme as
nextstrain.org.
Methods
Nextstrain downloads SARS-CoV-2 genomes from
GISAID
as they are submitted by labs worldwide, and downsamples to a subset of several thousand
sequences in order to provide an interactive display.
The selected subset of GISAID sequences is processed by an
automated pipeline,
producing an annotated phylogenetic tree data structure underlying the Nextstrain display;
UCSC downloads the results and extracts annotations for display.
Data Access
SARS-CoV-2 mutations displayed by Nextstrain are derived from a subset of
GISAID sequences, and the GISAID
Terms and Conditions
prohibit the redistribution of GISAID-derived data. They also require that the submitters of all
sequences be acknowledged when the mutations are used.
Nextstrain.org offers
phylogenetic trees, author credits and other files:
scroll to the bottom of the page and click "DOWNLOAD DATA", and a dialog with
download options appears.
All GISAID SARS-CoV-2 genome sequences and metadata are available for download from
GISAID EpiCoV™ by registered users.
We have a program faToVcf that can extract VCF from a multi-sequence FASTA alignment such as the
"msa_date"
download file from GISAID. faToVcf is available for Linux and MacOSX on the download server:
https://hgdownload.soe.ucsc.edu/admin/exe.
It requires at least 4GB of memory to process the complete msa_date file.
Here are some steps to get started using faToVcf:
Credits
This work is made possible by the open sharing of genetic data by research
groups from all over the world. We gratefully acknowledge their contributions.
Special thanks to
nextstrain.org for
sharing its analysis of genomes collected by
GISAID.
Data usage policy
The data presented here is intended to rapidly disseminate analysis of
important pathogens. Unpublished data is included with permission of the data
generators, and does not impact their right to publish. Please contact the
respective authors
if you intend to carry out further research using their data.
Author contact info is available via
nextstrain.org:
scroll to the bottom of the page, click "DOWNLOAD DATA" and click
"ALL METADATA (TSV)" in the resulting dialog.
References
Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher
RA.
Nextstrain: real-time tracking of pathogen evolution.
Bioinformatics. 2018 Dec 1;34(23):4121-4123.
PMID: 29790939; PMC: PMC6247931
Sagulenko P, Puller V, Neher RA.
TreeTime: Maximum-likelihood phylodynamic analysis.
Virus Evol. 2018 Jan;4(1):vex042.
PMID: 29340210; PMC: PMC5758920
Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ.
IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood
phylogenies.
Mol Biol Evol. 2015 Jan;32(1):268-74.
PMID: 25371430; PMC: PMC4271533
|