The GENCODE track is composed of all the gene models in the GENCODE VM23 release. By default, only the basic gene set
is displayed, which is a subset of the comprehensive gene set. The basic set represents
transcripts that GENCODE believes will be useful to the majority of users.
The track includes protein-coding genes, non-coding RNA genes, and pseudo-genes, though
pseudo-genes are not displayed by default.
The following table provides statistics for the VM23 release derived from the GTF file that
contains annotations only on the main chromosomes. More information on how they were generated
can be found in the GENCODE site.
GENCODE VM23 Release Stats
Genes
Observed
Transcripts
Observed
Protein-coding genes
21,849
Protein-coding transcripts
59,188
Long non-coding RNA genes
13,201
- full length protein-coding
45,391
Small non-coding RNA genes
6,108
- partial length protein-coding
13,797
Pseudogenes
13,681
Nonsense mediated decay transcripts
7,200
Immunoglobulin/T-cell receptor gene segments
700
Long non-coding RNA loci transcripts
18,339
For more information on the different gene tracks, see our Genes FAQ.
Display Conventions and Configuration
By default, this track displays only the basic GENCODE set, splice variants, and non-coding
genes. It includes options to display the comprehensive GENCODE set and pseudogenes. To customize
these options, the respective boxes can be checked or unchecked at the top of this description
page. Our FAQ includes examples of how to display a single transcript per gene and
switching between the basic and comprehensive gene sets.
This track also includes a variety of labels which identify the transcripts when visibility is
set to "full" or "pack". Gene symbols (e.g. NIPA1) are displayed by default, but additional
options include GENCODE Transcript ID (ENSMUST00000052204.5), UCSC Known Gene ID (uc009hdu.2),
and UniProt Display ID (Q8BHK1) . Additional information about gene and transcript
names can be found in our FAQ.
This track, in general, follows the display conventions for gene prediction tracks. The exons
for putative non-coding genes and untranslated regions are represented by relatively thin blocks,
while those for coding open reading frames are thicker. The following color key is used:
Black -- feature has a corresponding entry in the
Protein Data Bank (PDB)
Dark blue -- transcript has been
reviewed or validated by either the RefSeq or SwissProt staff
Medium blue -- other RefSeq
transcripts
Light blue -- non-RefSeq
transcripts
This track contains an optional codon coloring feature that allows users to
quickly validate and compare gene predictions. There is also an option to display the data as
a density graph, which
can be helpful for visualizing the distribution of items over a region.
Methods
All the GENCODE genes in the comprehensive set are downloaded from the GENCODE
website.
Data from other sources are correlated with the GENCODE data to build the knownTo_ tables.
Related Data
The GENCODE Genes transcripts are annotated in numerous tables. These
include tables that link GENCODE Genes transcripts to external datasets (such as
knownToLocusLink, which maps GENCODE Genes transcripts to Entrez identifiers, previously
known as Locus Link identifiers), and tables that detail some property of GENCODE Genes transcript
sequences (such as knownToPfam, which identifies any Pfam domains found in the GENCODE Genes
protein-coding transcripts).
One can see a full list of the associated tables by clicking the
View table schema link at the top of the page, or in the
Table Browser by selecting GENCODE Genes from the track menu;
this list is then available on the table menu. Note that some of these tables refer to
GENCODE Genes by its table name knownGene, sometimes abbreviated as known or
kg. While the complete set of annotation tables is too long to describe, some of the more
important tables are described below.
kgXref identifies the RefSeq, SwissProt, Rfam, or tRNA sequences (if any) which are
associated with each transcript.
knownToRefSeq identifies the RefSeq transcript that each GENCODE Genes transcript is
most closely associated with. That RefSeq transcript is the RefSeq transcript that the GENCODE
Genes transcript overlaps at the most bases.
knownGeneMrna contains the genomic sequence for each of the GENCODE Genes models.
This may not be the same as the actual mRNA used to validate the gene model.
knownGenePep contains the protein sequences derived from the knownGeneMrna transcript
sequences. Any protein-level annotations, such as the contents of the knownToPfam table, are based
on these sequences.
knownIsoforms maps each transcript to a cluster ID, a cluster of isoforms of
the same gene.
knownCanonical identifies the canonical isoform of each cluster ID or gene using the
ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS
principal transcript when available.
If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the
BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.
Data access
GENCODE Genes and its associated tables can be explored interactively using the
Table Browser or the
Data Integrator. The data are also all available as
downloadable files. For example, if you would like to download the entire GENCODE Genes set
as seen in the
View table schema page, the
knownGene.txt.gz file in the downloads directory contains a compressed version of the data. All
the tables can also be queried directly from our public MySQL servers. Information on accessing
this data through MySQL can be found on our
help page as well as on
our blog.
Credits
The GENCODE Genes track was produced at UCSC from the GENCODE comprehensive
gene set using a computational pipeline developed by Jim Kent and Brian Raney.