5' RNA/GRO-Seq TSS Analysis Tutorial
Sequencing the 5' end of cap-protected RNAs enables the
identification of Transcription Start Sites (TSS) at
nucleotide resolution. Several varieties of this method
exist, including CAGE, TSS-Seq, PRO-Cap, 5'RNA-Seq,
5'GRO-Seq etc., and each are designed to collect 5' ends of
RNA for sequencing but use different enzymatic or enrichment
strategies to achieve their goal. 5'GRO-Seq and
PRO-Cap are techniques that perform 5' RNA sequencing on
nascent RNA allowing the identification of TSS for unstable
transcripts such as eRNAs. These techniques are
particularly powerful for identifying active regulatory
elements (enhancers + promoters) and assessing their
activity in a quantitative manner with relatively low
sequencing coverage. For example, 20-40 million reads
from a 5'GRO-Seq experiment might yield the same number of
reads at enhancers as 200-400 million reads in a 5'RNA-Seq
This tutorial will take you through the basic process of
trying to analyze 5'RNA-Seq data with HOMER. Generally
speaking, the analysis of each 5' RNA sequencing method is
similar. The basic idea is to identify regions with a
high density of 5' RNA sequencing reads, which on the
surface sounds really similar to finding peaks in ChIP-Seq
data (and it is!).
Introduction to Transcriptional Initiation at Metazoan
To understand the analysis of 5'RNA data, it is
worth taking a moment highlight that there are multiple
'types' of promoters in living organisms. First of
all, there are different RNA polymerases including RNA
polymerase I (rRNA), II (mRNA, lncRNA, miRNA), III (tRNA),
IV(plant specific), viral polymerases, etc., and each
polymerase has different mechanisms of transcriptional
initiation that may vary between different distally
related organisms. Also be aware that different RNA
polymerases may generate RNAs with different covalent
modifications and may or may not be present in your 5' RNA
sequencing, depending on how the experiment was
performed. By in large most researchers are
interested in RNA polymerase II transcripts (mRNA) and as
a result most 5'RNA methods focus on the identification of
RNAs containing a 7-methylguanosine cap protecting their
With respect to RNA polymerase II initiation sites, there
are two generally recognized 'types' of TSS. Sharp
(or Focused) TSS initiate transcription from a single
nucleotide (or +/- 2 nt) and resemble the promoters found
in molecular biology text books. They often contain
well define core-promoter elements such as the TATA box
and usually initiate transcription from a purine preceded
by a pyrimidine (PyPu, i.e. CA, with the A being the
The other, more common TSS is a broad (or dispersed)
TSS. These promoters initiate transcription from
sevearl different sites within a large area (often 50-100
nt in size). These promoters usually lack core
promoter elements (no TATA box), but they each individual
initiation site DOES normally still initiate on a purine
preceded by a pyrimidine (PyPu).
False TSS - be careful of artifacts
A quick note about artifacts in 5'RNA-Seq data:
Most 5' RNA-Seq methodologies work by enriching for 5'
cap-protected RNA, which means that most of the sequence
data describes 5' RNA ends, but a fraction of it may be
noise from random RNA-Seq fragments (again, a lot like
ChIP-Seq). In particular, highly expressed RNAs may
yield "5'RNA-Seq" reads along the whole body of the gene
giving the appearance of alternative TSS which are likely
false positives. Because of this, I would highly
recommend using traditional RNA-Seq as a "background" when
analyzing 5' RNA-Seq data. This approach (describe
below) may remove several real TSS from the results, but
it is also likely to remove a large number of false
positives and clean up your analysis.
Transcplicing of transcripts (where the 5' end of one
transcript is added to the front of another) and recapping
(where a transcript is cleaved and a new cap placed on the
truncated product) are two phenomena you may want to think
carefully about when analysing 5' RNA-Seq data.
Transplicing will create false negatives and recapping
will create false-positives. In certain organisms,
such as C. elegans, transcplicing is very common, making
5'GRO-Seq a much better assay for identifying TSS than
5'RNA-Seq (i.e. measuring the 5' RNA ends before they have
a chance to transplice). In other organisms (e.g.
mouse, human, fly, etc.) it appears to be rare. The
degree to which transcription are 'recapped' is a matter
of debate because it can be hard to distinguish them from
true alternative TSS or noise in the 5' RNA-seq assay.
Preprocessing and Mapping
Depending on the specific method of 5'RNA-Seq
you are analyzing, you may or may not have to think about
processing the reads before you analyze the data with
HOMER. Most techniques simply yield sequence that
starts with the 5' end of the read, and nothing special
needs to be done. CAGE in particular may require you
to remove initial 'G's that may have been added to the 5'
end of the transcript during library construction.
Also, older CAGE protocols may require you to separate the
actual CAGE tags from longer 454 reads - refer the the
author/source of the data for how to deal with the
processing of these reads.
Mapping 5'RNA-Seq reads to the genome should be done with
a splicing-aware mapper like STAR (see here for more details on
mapping reads). You could use bowtie or
another DNA-based mapping algorithm for 5'GRO-Seq,
although STAR is fine for 5'GRO-Seq too.
Creating Tag Directories and Quality Control
Creation of a 5'RNA-Seq tag directory works the
same way as with ChIP-Seq or RNA-Seq.
Finding TSS from 5'RNA-Seq Data
The basic idea behind identifying TSS from 5'RNA
data is similar to finding peaks in ChIP-Seq data.
Active TSS are likely to generate several reads within a
confined space (<150 bp to cover both broad and focused
promoters). findPeaks is already designed to look
for regions like this, but, unlike for ChIP-Seq, we want
to make sure we search each strand independently.
Also, active TSS are likely to produce several reads from
the same initiation site, giving them the appearance of
clonal/PCR artifacts. However, in this case, we do
not what to penalize clonal reads since they provide the
dynamic range of expression at each nucleotide.
To find TSS with findPeaks, simply run:
findPeaks IMR90-5GROseq/ -o auto -style tss
If you also performed a traditional, non-5' version of the
assay (i.e. RNA-Seq for 5'RNA-Seq, or GRO-Seq for
5'GRO-Seq), then use that as background:
findPeaks IMR90-5GROseq/ -o auto -style tss
The "-style tss" automatically sets the options on
findPeaks to work well with 5'RNA-Seq data. The
option "-style tss" basically expands to "-C 0 -strand
separate -fragLength 1 -inputFragLength 1 -tbp 0 -inputtbp
0 -size 150". When used with "-o auto", the output
TSS will be placed in a file called 'tss.txt' in the
target tag directory.
Output: Peaks will be centered on the mode of the TSS -
i.e. the highest individual initiation site.
Creating UCSC Visualization Files
To visualize 5'RNA-Seq
experiments in the UCSC Genome Browser, we'll run the makeUCSCfile
(more info here
5'GRO-Seq is strand specific, we need to specify options
to ensure it is visualized on separate strands. For
makeUCSCfile IMR90-5GroSeq/ -o auto -style
You can also make 'coverage' tracks by extending the
fragments so that they 'pileup'. Instead of
specifying "-style tss", use "-strand separate" and
"-fragment given" to generate more traditional coverage
tracks, which are better for visualizing the data at
larger intervals (i.e. > 50kb).
You can also use makeBigWig.pl and makeMultiWigHub.pl if
you have a webserver at your disposal to post the
resulting bigWig files (covered in more depth here
). Each have an option
called '-cage' that will automatically generate nucleotide
Analysis of 5'RNA Data
Almost all of the routines
in HOMER dedicated to ChIP-Seq work well with 5'RNA
methods as well.