PhyloCSF PhyloCSF Power Track Settings
 
Relative branch length of local alignment, a measure of PhyloCSF statistical power

Track collection: PhyloCSF

+  Description
+  All tracks in this collection (7)

Display mode:      Duplicate track

Type of graph:
Track height: pixels (range: 12 to 60)
Data view scaling: Always include zero: 
Vertical viewing range: min:  max:   (range: 0 to 1)
Transform function:Transform data points by: 
Windowing function: Smoothing window:  pixels
Negate values:
Draw y indicator lines:at y = 0.0:    at y =
Graph configuration help
Data schema/format description and download
Assembly: SARS-CoV-2 Jan. 2020 (NC_045512.2)
Data last updated at UCSC: 2020-03-13 13:38:14

PhyloCSF Track Hub

Description


These tracks show evolutionary protein-coding potential as determined by PhyloCSF [1] to help identify conserved, functional, protein-coding regions of genomes. PhyloCSF examines evolutionary signatures characteristic of alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions (CSF = Codon Substitution Frequencies). PhyloCSF provides more information than conservation of the amino acid sequence, because it distinguishes the different codons that code for the same amino acid. One of PhyloCSF's main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing. More information on PhyloCSF can be found on the PhyloCSF wiki.


The Smoothed PhyloCSF track shows the PhyloCSF score for each codon in each of 6 frames, smoothed using an HMM. Regions in which most codons have score greater than 0 are likely to be protein-coding in that frame. No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power).


The PhyloCSF Power track shows the branch length score at each codon, i.e., the ratio of the branch length of the species present in the local alignment to the total branch length of all species in the full genome alignment. It is an indication of the statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to get a meaningful score at these codons. Codons with branch length score greater than 0.1 but much less than 1 should be considered less certain.


Caveats

  • Around 10% of annotated protein-coding regions in human get score less than 0. This can happen for various reasons. For example, the region could be coding in the reference species but not in other species, or the alignment does not represent a true orthology relationship between the species.
  • Protein-coding regions will often have positive score on the reverse strand in the frame in which the third codon positions match up (the "antisense" frame), though the score will usually be higher on the correct strand.


Methods


Tracks were constructed as described in Mudge et al. 2019 [2] and Jungreis et al. 2020 [3]. In brief, PhyloCSF was run with the "fixed" strategy on every codon in every frame on each strand in the wuhCor1/SARS-CoV-2 assembly using an alignment of 44 Sarbecovirus genomes, using the PhyloCSF parameters for 29mammals with the tree replaced with a tree of the 44 Sarbecovirus genomes. The scores were smoothed using a Hidden Markov Model (HMM) with 4 states, one representing coding regions and three representing non-coding regions. The emission of each codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood ratio of the alignment under the coding and non-coding models. The three non-coding states have the same emissions probabilities but different transition probabilities (they can only transition to coding) to better capture the multimodal distribution of gaps between same-frame coding exons. These transition probabilities represent the best approximation of this gap distribution as a mixture model of three exponential distributions, computed using Expectation Maximization. The HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon and nearby codons on the same strand in the same frame, without taking into account start codons, stop codons, or potential splice sites. PhyloCSF+0 shows the log-odds that codons in frame 0 on the '+' strand are in the coding state according to the HMM, and similarly for strand '-' and frames 1 and 2.


Credits


Questions about the algorithm itself should be directed to Irwin Jungreis.


Citing the PhyloCSF Tracks


If you use the PhyloCSF browser tracks, please cite Mudge et al. 2019 [2] and Jungreis et al. 2020 [3].


References


[1] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011).

[2] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright J, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary J, Frankish A, Kellis M (2019). Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research gr-246462. doi: 10.1101/gr.246462.118.

[3] Jungreis I, Saelfon R and Kellis M (2020). Sarbecovirus comparative genomics elucidates gene content of SARS-CoV-2 and functional impact of COVID-19 pandemic mutations. Biorxiv 2020.