PhyloCSF Tracks

JavaScript is disabled in your web browser

You must have JavaScript enabled in your web browser to use the Genome Browser

PhyloCSF Tracks

Assembly: SARS-CoV-2 Jan. 2020 (NC_045512.2)

Description

These tracks show evolutionary protein-coding potential as determined by PhyloCSF [1] to help identify conserved, functional, protein-coding regions of genomes. PhyloCSF examines evolutionary signatures characteristic of alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions (CSF = Codon Substitution Frequencies). PhyloCSF provides more information than conservation of the amino acid sequence, because it distinguishes the different codons that code for the same amino acid. One of PhyloCSF's main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing. More information on PhyloCSF can be found on the PhyloCSF wiki.

The Smoothed PhyloCSF track shows the PhyloCSF score for each codon in each of 6 frames, smoothed using an HMM. Regions in which most codons have score greater than 0 are likely to be protein-coding in that frame. No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power).

The PhyloCSF Power track shows the branch length score at each codon, i.e., the ratio of the branch length of the species present in the local alignment to the total branch length of all species in the full genome alignment. It is an indication of the statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to get a meaningful score at these codons. Codons with branch length score greater than 0.1 but much less than 1 should be considered less certain.

Caveats

Around 10% of annotated protein-coding regions in human get scores less than 0. This can happen for various reasons. For example, the region could be coding in the reference species but not in other species, or the alignment does not represent a true orthology relationship between the species.
Protein-coding regions will often have positive score on the reverse strand in the frame in which the third codon positions match up (the "antisense" frame), though the score will usually be higher on the correct strand.

Methods

Tracks were constructed as described in Mudge et al. 2019 and Jungreis et al. 2020. In brief, PhyloCSF was run with the "fixed" strategy on every codon in every frame on each strand in the wuhCor1/SARS-CoV-2 assembly using an alignment of 44 Sarbecovirus genomes, using the PhyloCSF parameters for 29mammals with the tree replaced with a tree of the 44 Sarbecovirus genomes.

The scores were smoothed using a Hidden Markov Model (HMM) with 4 states, one representing coding regions and three representing non-coding regions. The emission of each codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood ratio of the alignment under the coding and non-coding models. The three non-coding states have the same emissions probabilities but different transition probabilities (they can only transition to coding) to better capture the multimodal distribution of gaps between same-frame coding exons. These transition probabilities represent the best approximation of this gap distribution as a mixture model of three exponential distributions, computed using Expectation Maximization. The HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon and nearby codons on the same strand in the same frame, without taking into account start codons, stop codons, or potential splice sites. PhyloCSF+1 shows the log-odds that codons in frame 1 (sometimes called frame 0) on the '+' strand are in the coding state according to the HMM, and similarly for strand '-' and frames 2 and 3.

Data Access

The raw bigWig data can be explored interactively with the Table Browser, combined with other datasets in the Data Integrator tool, or downloaded directly from the download server. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information.

Credits and Citations

Questions about the algorithm itself should be directed to Irwin Jungreis. If you use the PhyloCSF browser tracks, please cite Mudge et al. 2019 and Jungreis et al. 2020.

References

Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011 Jul 1;27(13):i275-82. PMID: 21685081; PMC: PMC3117341

Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 2019 Dec;29(12):2073-2087. PMID: 31537640; PMC: PMC6886504

Jungreis I, Sealfon R, Kellis M. Sarbecovirus comparative genomics elucidates gene content of SARS-CoV-2 and functional impact of COVID-19 pandemic mutations. bioRxiv. 2020 Jun 3;. PMID: 32577641; PMC: PMC7302193