Description
This track displays a phylogenetic tree inferred from SARS-CoV-2 genome sequences collected by
GISAID,
and mutations found in the sequences. By default, only very common mutations (alternate allele found
in at least 1% of samples) are displayed, but other subtracks may be made visible in order to see
more rare mutations.
The phylogenetic tree is inferred by Rob Lanfear's
sarscov2phylo pipeline.
For display in the narrow space to the left of the main genome browser image, nodes in the tree
are collapsed unless a mutation is associated with a node; i.e. the only branching points displayed
are those at which mutations occurred.
Two options for coloring the tree, by
Pangolin
lineage (Rambaut et al.) or
GISAID clade,
are available.
Both coloring schemes are adapted from Figure 1 of
(Alm et al.) which presents a unified view of a simplified
phylogenetic tree, Pangolin lineages,
Nextstrain clades and GISAID clades.
color | lineage(s) | Nextstrain clade | GISAID clade |
|
A |
19B |
S |
|
B.n (n > 1) |
19A |
L |
|
n/a (color not used when coloring by lineage; overlaps on tree with B.4 - B.7) |
n/a (overlaps on tree with 19A) |
O |
|
n/a (color not used when coloring by lineage; overlaps on tree with B.2) |
n/a (overlaps on tree with 19A) |
V |
|
B.1.5, B.1.6, B.1.8, other B.1.n that overlap GISAID clade G |
20A (partial) |
G |
|
B.1.9, B.1.13, B.1.22, B.1.22, B.1.36, B.1.37 |
20A (partial) |
GH (partial) |
|
B.1.3, B.1.12, B.1.26, other B.1.n that overlap GISAID clade GH |
20C |
GH (partial) |
|
B.1.1 |
20B |
GR |
Display Conventions
In "dense" mode, a vertical line is drawn at each position where there is a mutation.
In "pack" mode, the display shows a plot of all samples' mutations, with samples
ordered using the phylogenetic tree in order to highlight patterns of linkage.
Each sample is placed in a horizontal row of pixels; when the number of
samples exceeds the number of vertical pixels for the track, multiple
samples fall in the same pixel row and pixels are averaged across samples.
Each mutation is a vertical bar at its position in the SARS-CoV-2 genome
with white (invisible) representing the reference allele;
the non-reference allele is shown in red if it changes the protein sequence of a gene,
green if it falls within a gene but does not change the protein,
and black if it does not fall within a gene.
Tick marks are drawn at the top and bottom of each mutation's vertical bar
to make the bar more visible when most alleles are reference alleles.
Only single-nucleotide mutations are displayed, not insertions or deletions.
The phylogenetic tree showing inferred relationships between the samples is depicted
in the left column of the display.
Mousing over this will show the GISAID identifiers for the different samples.
At the default track height, about 100 samples are averaged into each row of pixels.
The track height can be adjusted in the track controls, which can be reached by
clicking on the gray button to the left of the tree or by right-clicking on the image.
Methods
Rob Lanfear regularly runs the
sarscov2phylo pipeline
on all complete, high-coverage sequences available from
GISAID.
The pipeline aligns all sequences to the same reference genome used by the Genome Browser
(RefSeq NC_045512.2,
GenBank MN908947.3,
GISAID sample hCoV-19/Wuhan/Hu-1/2019|EPI_ISL_402125|2019-12-31)
using
MAFFT
(Katoh et al.).
It masks sites identified as problematic by the
ProblematicSites_SARS-CoV2 repository
(De Maio et al., Turakhia et al.),
as well as sites that are N's or gaps in >50% of samples.
fasttree
(Price et al.)
is used to infer the phylogenetic tree;
sequences on very long branches are removed using
TreeShrink
(Mai et al.).
The tree is re-rooted to hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05.
For full details, see the
sarscov2phylo
documentation.
Collapsing of nodes that do not have an associated mutation is done using
strain_phylogenetics
(Turakhia et al.).
Data Access
You can download the VCF files underlying this track (gisaid.*.vcf.gz) from our
Download Server. The data can be explored interactively with the
Table Browser
or the Data Integrator.
Note: while the VCF files contain mutations found in sequences collected by
GISAID,
they are not sufficient to reconstruct the original sequences available from GISAID
due to treatment of ambiguous IUPAC bases as missing information in the VCF and
omission of insertion and deletion mutations. Additionally, the subtracks that are
filtered to include only mutations found in a minimum percentage of samples give
very incomplete representations of samples. Researchers wishing to work with SARS-CoV-2
genomic sequences should register with GISAID and download the full sequences.
Credits
This work is made possible by the open sharing of genetic data by research
groups from all over the world. We gratefully acknowledge their contributions.
Sequences are collected by
GISAID
and may be downloaded by registered users.
Special thanks to
Rob Lanfear
for developing, running and sharing the
sarscov2phylo pipeline
and results.
Data usage policy
The data presented here is intended to rapidly disseminate analysis of
important pathogens. Unpublished data is included with permission of the data
generators, and does not impact their right to publish. Please contact the
respective authors
if you intend to carry out further research using their data.
Author contact info is available via
https://github.com/roblanf/sarscov2phylo/tree/master/acknowledgements.
References
Rambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG.
A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology.
Nat Microbiol. 2020 Nov;5(11):1403-1407.
PMID: 32669681
Alm E, Broberg EK, Connor T, Hodcroft EB, Komissarov AB, Maurer-Stroh S, Melidou A, Neher RA,
O'Toole Á, Pereyaslov D et al.
Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to
June 2020.
Euro Surveill. 2020 Aug;25(32).
PMID: 32794443; PMC: PMC7427299
Katoh K, Standley DM.
MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
Mol Biol Evol. 2013 Apr;30(4):772-80.
PMID: 23329690; PMC: PMC3603318
De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N.
Masking strategies for SARS-CoV-2 alignments.
virological.org. 2020 May 13.
De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N.
Updated analysis with data from 12th June 2020.
virological.org. 2020 July 14.
Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, and Corbett-Detig R.
Stability of SARS-CoV-2 Phylogenies.
bioRxiv. 2020 June 9.
Price MN, Dehal PS, Arkin AP.
FastTree 2--approximately maximum-likelihood trees for large alignments.
PLoS One. 2010 Mar 10;5(3):e9490.
PMID: 20224823; PMC: PMC2835736
Mai U, Mirarab S.
TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic
trees.
BMC Genomics. 2018 May 8;19(Suppl 5):272.
PMID: 29745847; PMC: PMC5998883
|