ENC RNA-seq Caltech RNA-seq Track Settings
 
RNA-seq from ENCODE/Caltech

Track collection: ENCODE RNA-seq

+  Description
+  All tracks in this collection (5)

Maximum display mode:       Reset to defaults   
Select views (Help):
Raw Signal ▾       Plus Raw Signal ▾       Minus Raw Signal ▾       Splice Sites ▾       Alignments ▾      
Select subtracks by read type and cell line:

  Replicate: 1 2 3 4
 All Read Type Paired 75 nt
(200 bp)
 
 Paired 75 nt
(400 bp)
 
 Single Strand-Specific 75 nt 
Cell Line
GM12878 (Tier 1) 
H1-hESC (Tier 1) 
K562 (Tier 1) 
HeLa-S3 (Tier 2) 
HepG2 (Tier 2) 
HUVEC (Tier 2) 
LHCN-M2 Myoblast (Tier 2) 
LHCN-M2 Myocyte 7 d (Tier 2) 
MCF-7 (Tier 2) 
GM12891 
GM12892 
HCT-116 
HSMM 
NHEK 
NHLF 
List subtracks: only selected/visible    all    ()
  Cell Line↓1 Read Type↓2 views↓3 Replicate↓4   Track Name↓5    Restricted Until↓6
 
full
 Configure
 K562  Paired 75 nt (200 bp)  Raw Signal  2  K562 200 bp paired read RNA-seq Signal Rep 2 from ENCODE/Caltech    Data format   2012-02-11 
     Restriction Policy
Assembly: Human Feb. 2009 (GRCh37/hg19)

Description

This track was produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-seq was performed by reverse-transcribing an RNA sample into cDNA, followed by high-throughput DNA sequencing, which was done on an Illumina Genome Analyzer (GAI or GAIIx) (Mortazavi et al., 2008). The transcriptome measurements shown on these tracks were performed on polyA selected RNA from total cellular RNA using two different protocols: one that preserves information about which strand the read is coming from and one that does not. Due to the specifics of the enzymology of library construction, gene and transcript quantification is more accurate based on the non-strand-specific protocol, while the strand-specific protocol is useful for assigning strandedness, but in general less reliable for quantification.

Non-strand-specific Protocol (deep "reference" transcriptome measurements, 2x75 bp reads)

PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis, converted into cDNA by random priming and then amplified. Data were produced in two formats: single reads, each of which came from one end of a cDNA molecule, and paired-end reads, which were obtained as pairs from both ends of cDNAs. This RNA-seq protocol does not specify the coding strand. As a result, there is ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts.

Strand-specific Protocol (1x75 bp reads)

PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis. A set of 3' and 5' adapters were ligated to the 3' and 5' ends of the fragments, respectively. The resulting RNA molecules were converted to cDNA and amplified. This RNA-seq protocol does specify the coding strand as each read is in the same 5'-3' orientation as the original RNA strand. As a result, loci where both strands are transcribed can be disambiguated. However, RNA ligation is an inherently biased process and as a result, greater unevenness in sequence coverage across transcripts is observed compared to the non-strand-specific data, and quantification is less accurate.

Data Analysis

Reads were aligned to the hg19 human reference genome using TopHat (Trapnell et al., 2009), a program specifically designed to align RNA-seq reads and discover splice junctions de novo. Cufflinks (Trapnell et al., 2010), a de novo transcript assembly and quantification software package, was run on the TopHat alignments to discover and quantify novel transcripts and to obtain transcript expression estimates based on the GENCODE annotation. All sequence files, alignments, gene and transcript models and expression estimates files are available for download.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. The following views are in this track:

Alignments
The Alignments view shows reads aligned to the genome. Alignments are colored by cell type. The tags used in this file are: NH XS CP NM CC. The 'XS' custom tag indicates the sense/anti-sense of the strand. See the Bowtie Manual (Langmead et al., 2009) for more information about the SAM Bowtie output (including other tags) and the SAM Format Specification for more information on the SAM/BAM file format.

For Stranded Data (1x75)

Plus Raw Signal (only for stranded data)
Density graph (wiggle) of signal enrichment based on a normalized aligned read density (Read Per Million, RPM) for reads aligning to the forward strand.
Minus Raw Signal (only for stranded data)
Density graph (wiggle) of signal enrichment based on a normalized aligned read density (Read Per Million, RPM) for reads aligning to the reverse strand. Minus strand score is multiplied by -1 for display purposes so that data can be viewed around an axis.

For Paired-End Non-Stranded Data (2x75)

Raw Signal (only for paired-end data)
Density graph (wiggle) of signal enrichment based on a normalized aligned read density (Read Per Million, RPM). The RPM measure assists in visualizing the relative amount of a given transcript across multiple samples. A separate track for the forward (plus) and reverse (minus) strands are provided for strand-specific data.
Splice Sites
Subset of aligned reads that crosses splice junctions.

Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.

Methods

Experimental Procedures

Cells were grown according to the approved ENCODE cell culture protocols except for H1-hESC for which frozen cell pellets were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNeasy kit) and processed on RNeasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNase digestion step to remove residual genomic DNA.

A quantity of 75 µg of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. For 2x75 bp non-stranded RNA-seq, 100 ng of mRNA was then processed according to the protocol in Mortazavi et al. (2008), and prepared for sequencing on the Genome Analyzer flow cell according to the protocol for the ChIP-seq DNA genomic DNA kit (Illumina). The majority of paired-end libraries were size-selected around 200 bp (fragment length) with the exception of a few additional replicates that were size-selected at 400 bp with the specific intent to investigate the effect of fragment length on results. Strand-specific RNA-seq libraries were prepared from 100 ng of mRNA from the same preparation following Illumina's Strand-Specific RNA-seq protocol.

Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Reads of 75 bp length were obtained, single-end for directional, strand-specific libraries (1x75D) and paired-end for non-strand-specific libraries (2x75).

Data Processing and Analysis

Reads were mapped to the reference human genome (version hg19), with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes and haplotypes in all cases, using TopHat (version 1.0.14). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance. After mapping reads to the genome and identifying splice junctions, the data were further analyzed using the transcript assembly and quantification software Cufflinks (version 0.9.3) using the sequence bias detection and correction option. Cufflinks was used in two modes: 1) expression for genes and individual transcripts was quantified based on the GENCODE annotation, for both versions v3c and v4 of GENCODE GRCh37; 2) Cufflinks was run in de novo transcript assembly and quantification mode to obtain candidate novel transcript and gene models and expression estimates for them.

Downloadable Files

The following files can be found on the downloads page:

.fastq - Raw sequence files in fastq format with phred33 quality scores.
Junctions.bedRnaElements - A BED file containing TopHat-defined splice junctions.
TranscriptDeNovo.gtf - A GTF file containing transcript models and expression estimates in FPKM (Fragments Per Kilobase per Million reads) produced by Cufflinks in de novo mode.
TranscriptGencV3c.gtf - Expression level estimates at the transcript level for the GENCODE GRCh37.v3c annotation in GTF format.
GenesDeNovo.gtf - Expression estimates for genes defined by Cufflinks in de novo mode in GTF format.
GeneGencV3c.gtf - Expression level estimates at the gene level for the GENCODE GRCh37.v3c annotation in GTF format.
ExonGencV3c.gtf - Expression level estimates for GENCODE GRCh37.v3c exons in GTF format derived by summing the expression levels in FPKM for all transcripts containing a given exon.
TSS.gtf - Expression level estimates for GENCODE GRCh37.v3c transcription start sites (TSS) in GTF format derived by summing the expression levels in FPKM for all transcripts originating from a given TSS.

Verification

  • Known exon maps as displayed on the genome browser are confirmed by the alignment of sequence reads.
  • Known spliced exons are detected at the expected frequency for transcripts of given abundance.
  • Linear range detection of spiked-in RNA transcripts from Arabidopsis and phage lambda over 5 orders of magnitude.
  • Endpoint RT-PCR confirms presence of selected 3' UTR extensions.
  • Correlation to published microarray data r = 0.62.

Release Notes

This is release 4 (August 2012). Fastq files for GM12892, GM12891 and K562 (R1x75) were replaced after errors were found in the GEO submission process.

Credits

Wold Group: Ali Mortazavi, Brian Williams, Georgi Marinov, Diane Trout, Brandon King, Ken McCue, Lorian Schaeffer.

Myers Group: Norma Neff, Florencia Pauli, Fan Zhang, Tim Reddy, Rami Rauch, Chris Partridge.

Illumina gene expression group: Gary Schroth, Shujun Luo, Eric Vermaas.

TopHat/Cufflinks development: Cole Trapnell, Lior Pachter, Steven Salzberg.

Contacts: Georgi Marinov (data coordination/informatics/experimental), Diane Trout (informatics) and Brian Williams (experimental).

References

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 Jul;5(7):621-8.

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11.

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.