Gene Model Information Table Fields
 
  • category - This is either coding, noncoding, antisense or nearCoding. A coding transcript is one where the evidence is relatively good that it produces a protein. The nearCoding transcripts overlap coding transcripts by at least 20 bases on the same strand, but themselves do not seem to produce protein products. In many cases this is because they are splicing varients with introns after the stop codon, that therefore undergo nonsense mediated decay. Antisense transcripts overlap coding transcripts by at least 20 bases on the oppposite strand. The other transcripts, which are neither coding, nor overlapping coding, are categorized as noncoding.
  • exon count - The number of exons in the gene. Single exon genes are generally somewhat less reliable than multi-exon genes, though there are many well-known genuine single exon genes such as the Histones and the Sox family.
  • ORF size - The size of the open reading frame in the mRNA. Divide by three to get the size of the protein.
  • txCdsPredict score - The score from the txCdsPredict program. This program weighs a variety of evidence including the presence of a Kozak consensus sequence at the start codon, the length of the ORF, the presense of upstream ORFs, homology in other species, and nonsense mediated decay. In general a score over 1000 is almost certain to be a protein, and scores over 800 have about a 90% chance of being a protein.
  • has start codon - Indicates if the initial codon is an ATG.
  • has end codon - Indicates if the last codon is TAA, TAG, or TGA.
  • selenocysteine - Indicates if this is one of the special proteins where TGA encodes the animo acid selenocysteine rather than encoding a stop codon.
  • nonsense-mediated-decay - Indicates whether the final intron is more than 55 bases after the stop codon. If true, then generally the mRNA will be degraded before it can produce a detectable amount of protein. Therefore when this condition is true we remove the predicted coding region from the transcript.
  • CDS single in 3' UTR - This is a strong indicator that the coding region (CDS) is a coincidental open reading frame rather than a true indication that the transcript codes for protein. This indicates that the coding sequence resides in a single exon, and that this exon is located entirely in the 3' UTR of another transcript that codes for a different protein not overlapping the ORF in the same frame. We remove the CDS from non-refSeq transcripts that meet this condition, which often results from a retained intron or from missing the initial parts of a transcript.
  • CDS single in intron - This is another strong indicator that the ORF is not real. Here the coding region (CDS) lies entirely in the intron of another transcript which has strong evidence of coding for a protein. We remove the CDS from non-refSeq transcripts that meet this condition, which generally results from a retained intron.
  • frame shift in genome - This only occurs for RefSeq transcripts. Here a frame shift is detected in the coding region when aligning the transcript against the genome. Since RefSeq does examine these cases carefully, it is strong evidence that the genome sequence is in error, or that the anonymous DNA donor carried a frame-shift mutation in this gene. In general there will be multiple independent cDNA clones supporting the RefSeq over the genome. In the gene display on the browser, one or two bases will be removed from the gene to keep frame intact.
  • stop codon in genome - This also only occurs for RefSeq transcripts, and as with the frame shifts, there is generally multiple lines of evidence suggesting sequencing error or mutation in the reference genome. In the gene display on the browser three bases will be removed from the gene to avoid the stop.
  • retained intron - The transcript contains what is an intron in an overlapping transcript on the same strand. In many cases this indicates that the transcript was not completely processed. Unless specific steps are taken to isolate cytoplasmic rather than nuclear RNA, a certain fraction of the RNA isolated for sequencing will be incompletely processed. Transcripts with retained introns should be viewed suspiciously, especially if they are long. However there are cases where fully mature mRNA transcripts are made with and without a particular intron, so transcripts with retained introns are not eliminated from this gene set.
  • end bleed into intron - Very often when an intron is retained, it is so long that the next exon is not reached and sequenced. In this case the retained intron can't be detected directly. However high values of "end bleed" are strongly suggestive of a retained intron. End bleed measures how far the end of a transcript extends into an intron defined by another overlapping transcript. Note however that alternative promoters and alternative polyadenylation sites can create end bleeds in fully mature transcripts.
  • RNA accession - The RefSeq, Genbank/EMBL/DJJ, Rfam, or tRNA accession accession on which this transcript is most closely based. Note that the splice sites when possible are taken from a consensus of RNA alignments rather than just from a single RNA. For non-RefSeq genes the bases are taken from the genome rather than the RNA. However the transcript does define which introns and exons are used to build the transcript.
  • RNA size - The size of the RNA on which this transcript is most closely based, including the poly-A tail (if any).
  • Alignment % ID - Percentage identity within of alignment of RNA to genome.
  • % Coverage - The percentage of the RNA covered by the alignment to genome. This excludes the poly-A tail, if the RNA has one.
  • # of Alignments: - The number of times this RNA aligns to the genome at very high stringency. More care must be taken in interpreting genes based on transcripts with multiple alignments. We do substantial filtering to avoid pseudogenes, but extremely recent, extremely complete pseudogenes may still pass these filters and cause multiple alignments.
  • # of AT/AC introns - The number of introns in this transcript with AT/AC rather than the usual GT/AG ends. There are roughly 300 genes with legitimate AT/AC introns.
  • # of strange splices - The number of introns that have ends which are neither GT/AG, GC/AG, nor AT/AC. Many of these are the result of sequencing errors, or polymorphisms between the DNA donors and the RNA donors.