Description
These tracks show evolutionary protein-coding potential as determined by
PhyloCSF [1] to help identify conserved, functional, protein-coding regions of genomes.
PhyloCSF examines evolutionary signatures characteristic of alignments of conserved coding
regions, such as the high frequencies of synonymous codon substitutions and conservative
amino acid substitutions, and the low frequencies of other missense and nonsense substitutions
(CSF = Codon Substitution Frequencies). PhyloCSF provides more information than conservation of
the amino acid sequence, because it distinguishes the different codons that code for the same
amino acid. One of PhyloCSF's main current applications is to help distinguish protein-coding
and non-coding RNAs represented among novel transcript models obtained from high-throughput
transcriptome sequencing. More information on PhyloCSF can be found on the
PhyloCSF wiki.
The Smoothed PhyloCSF track shows the PhyloCSF score for each codon in
each of 6 frames, smoothed using an HMM. Regions in which most codons have score greater than
0 are likely to be protein-coding in that frame. No score is shown when the relative branch
length is less than 0.1 (see PhyloCSF Power).
The PhyloCSF Power track shows the branch length score at each codon,
i.e., the ratio of the branch length of the species present in the local alignment to the
total branch length of all species in the full genome alignment. It is an indication of the
statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have
been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to
get a meaningful score at these codons. Codons with branch length score greater than 0.1 but
much less than 1 should be considered less certain.
Caveats
- Around 10% of annotated protein-coding regions in human get scores less than 0.
This can happen for various reasons. For example, the region could be coding in the reference
species but not in other species, or the alignment does not represent a true orthology
relationship between the species.
- Protein-coding regions will often have positive score on the reverse strand
in the frame in which the third codon positions match up (the "antisense" frame), though the
score will usually be higher on the correct strand.
Methods
Tracks were constructed as described in
Mudge et al. 2019 and Jungreis et al. 2020. In brief, PhyloCSF was run with the "fixed" strategy
on every codon in every frame
on each strand in the wuhCor1/SARS-CoV-2 assembly using an alignment of 44 Sarbecovirus genomes,
using the PhyloCSF parameters for 29mammals with the tree replaced with a tree of the 44
Sarbecovirus genomes.
The scores were smoothed using a Hidden Markov Model (HMM) with 4 states,
one representing coding regions and three representing non-coding regions. The emission of each
codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and
non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood
ratio of the alignment under the coding and non-coding models. The three non-coding states have
the same emissions probabilities but different transition probabilities (they can only transition
to coding) to better capture the multimodal distribution of gaps between same-frame coding exons.
These transition probabilities represent the best approximation of this gap distribution as a
mixture model of three exponential distributions, computed using Expectation Maximization. The
HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon
and nearby codons on the same strand in the same frame, without taking into account start codons,
stop codons, or potential splice sites. PhyloCSF+1 shows the log-odds that codons in frame 1
(sometimes called frame 0) on the '+' strand are in the coding state according to the HMM, and
similarly for strand '-' and frames 2 and 3.
Data Access
The raw bigWig data can be explored interactively with the
Table Browser, combined with other datasets in the
Data Integrator tool, or downloaded directly from
the download server.
Please refer to our
mailing list archives
for questions, or our
Data Access FAQ
for more information.
Credits and Citations
Questions about the algorithm itself should be directed to
Irwin Jungreis.
If you use the PhyloCSF browser tracks, please cite Mudge et al. 2019 and
Jungreis et al. 2020.
References
Lin MF, Jungreis I, Kellis M.
PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.
Bioinformatics. 2011 Jul 1;27(13):i275-82.
PMID: 21685081; PMC: PMC3117341
Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, Davidson C, Fitzgerald S, Seal R,
Tweedie S et al.
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps
elucidate 118 GWAS loci.
Genome Res. 2019 Dec;29(12):2073-2087.
PMID: 31537640; PMC: PMC6886504
Jungreis I, Sealfon R, Kellis M.
Sarbecovirus comparative genomics elucidates gene content of SARS-CoV-2 and functional impact of
COVID-19 pandemic mutations.
bioRxiv. 2020 Jun 3;.
PMID: 32577641; PMC: PMC7302193
|