Software for motif discovery and next-gen sequencing analysis

Quantifying Data and Motifs and Comparing Peaks/Regions in the Genome

Homer contains a useful, all-in-one program for performing peak annotation called annotatePeaks.pl.  In addition to associating peaks with nearby genes, annotatePeaks.pl can perform Gene Ontology Analysis, genomic feature association analysis (Genome Ontology), associate peaks with gene expression data, calculate ChIP-Seq Tag densities from different experiments, and find motif occurrences in peaks.  annotatePeaks.pl can also be used to create histograms and heatmaps.  Description of the annotation functions are covered here, while quantification of tags, motifs, histograms, etc. are covered below.

Basic usage (see Annotation):

annotatePeaks.pl <peak/BED file> <genome> [options] > <output file>

i.e. annotatePeaks.pl ERpeaks.txt hg18 > outputfile.txt

Everything packed into one program

The great thing about annotatePeaks.pl is that it combines many features into a single location.  It is the primary program to investigate how sequencing reads, sequence motifs, and other information and annotations interact.

Three primary options are available to specify types of data that can be processed by annotatePeaks.pl:

-d <tag directory1> [tag directory 2] ...
-m <motif file 1> [motif file 2] ...
-p <peak/BED file 1> [peak/BED file 2] ...

By default, these data types are processed relative to each peak/region provided in the primary input file.  There are a bunch of options that help fine tune how each type of data is considered by the program covered below.

However, annotatePeaks.pl can take the same input data and do other things, such as make histograms and heatmaps, allowing you to explore the data in a different way.

-hist <# bin size>

Acceptable Input files

annotatePeaks.pl accepts HOMER peak files or BED files:

HOMER peak files should have at minimum 5 columns (separated by TABs, additional columns will be ignored):
  • Column1: Unique Peak ID
  • Column2: chromosome
  • Column3: starting position
  • Column4: ending position
  • Column5: Strand (+/- or 0/1, where 0="+", 1="-")
BED files should have at minimum 6 columns (separated by TABs, additional columns will be ignored)
  • Column1: chromosome
  • Column2: starting position
  • Column3: ending position
  • Column4: Unique Peak ID
  • Column5: not used
  • Column6: Strand (+/- or 0/1, where 0="+", 1="-")
In theory, HOMER will accept BED files with only 4 columns (+/- in the 4th column), and files without unique IDs, but this is NOT recommended.  For one, if you don't have unique IDs for your regions, it's hard to go back and figure out which region contains which peak.

Mac Users: If using a EXCEL to prepare input files, make sure to save files as a "Text (Windows)" if running MacOS - saving as "Tab delimited text" in Mac produces problems for the software.  Otherwise, you can run the script "changeNewLine.pl <filename>" to convert the Mac-formatted text file to a Windows/Dos/Unix formatted text file.

If errors occur, it is likely that the file is not in the correct format, or the first column is not actually populated with unique identifiers.

TSS Mode

Instead of supplying a peak file, HOMER makes it easy to do analysis of features near TSS.  To analyze transcription start sites as if they were peaks, use "tss" as the first argument.

annotatePeaks.pl tss hg18 ...

To restrict the analysis to a subset of TSS promoters, add the option "-list <file>" where the file is a Tab-delimited text file with the first column containing gene Identifiers.  TSS mode also works with custom gene definitions specified with "-gtf <GTF file>".

Specifying the Peak Size - the most important parameter

Pretty much everything explained in the following sections depends heavily on the "-size <#>" parameter.  A couple of quick notes:

-size <#> : will perform analysis on the # bp surrounding the peak centers [example: -size 1000]
-size <#,#> : will perform analysis from # to # relative to peak center [example: -size -200,50]
-size given : will perform analysis on different sized peaks - size given by actual coordinates in peak/BED file [example: -size given]

For example, if you peaks are actually transcription start sites, you might want to specify "-size -500,100" to perform the analysis upstream -500 bp to +100 bp downstream.  If your peaks/regions are actually "transcript" regions, specifying "-size given" will count reads along the entire transcript.  If it doesn't make sense, watch Delta Force I and II back to back.  That should numb the brain enough to get it.

Annotating Individual Peaks

Calculating ChIP-Seq Tag Densities across different experiments

annotatePeaks.pl is useful program for cross-referencing data from multiple experiments.  In order to count the number of tags from different sequencing experiments, you must first create tag directories for each of these experiments.  Once created, tag counts from these directories in the vicinity of your peaks can be added by specifying "-d <tag directory 1> <tag directory 2> ...".  You can specify as many tag directories as you like.  Tag totals for each directory will be placed in new columns starting on column 18.  For example:

annotatePeaks.pl pu1peaks.txt mm8 -size 400 -d Macrophage-PU.1/ Bcell-PU.1/ > output.txt

output.txt, when opened in EXCEL, will look like this:

HOMER automatically normalizes each directory by the total number of mapped tags such that each directory contains 10 million tags.  This total can be changed by specifying "-norm <#>" or by specifying "-noadj", which will skip this normalization step.

The other important parameter when counting tags is to specify the size of the region you would like to count tags in with "-size <#>".  For example, "-size 1000" will count tags in the 1kb region centered on each peak, while "-size 50" will count tags in the 50 bp region centered on the peak (default is 200).  The number of tags is not normalized by the size of the region.

One last thing to keep in mind is that in order to fairly count tags, HOMER will automatically center tags based on their estimated ChIP-fragment lengths.  This is can be overridden by specifying a fixed ChIP-fragment length using "-len <#>" or "-fragLength <#>".  This is important to consider when trying to count RNA tags, or things such as 5' RNA CAGE/TSS-Seq, where you may want to specify "-len 0" so that HOMER doesn't try to move the tags before counting them.

HOMER can also quantify signal in bedGraph and WIG files using "-bedGraph <bedgraph file1> ..." and "-wig <wiggle file1> ...", respectively.

Making Scatter Plots

X-Y scatter plots are a great way to present information, and by counting tag densities from different tag directories, you can visualize the relative levels of different sequencing experiments.  Because of the large range of values, it can be difficult to appreciate the relationship between data sets without log transforming the data (or sqrt to stay Poisson friendly).  Also, due to the digital nature of tag counting, it can be hard to properly assess the data from a X-Y scatter plot since may of the data points will have the same values and overlap.  To assist with these issues, you can specify "-log" or "-sqrt" to transform the data.  These functions will actually report "log(value+1+rand)" and "sqrt(value+rand)", respectively, where rand is a random "fraction of a tag" that adds jitter to your data so that data points with low tag counts will not have exactly the same value. For example, lets look at the distribution of H3K4me1 and H3K4me3 near Oct4 peaks in mouse embryonic stem cells:

annotatePeaks.pl Oct4.peaks.txt mm8 -size 1000 -d H3K4me1-ChIP-Seq/ H3K4me3-ChIP-Seq/ > output.txt

Opening output.txt with EXCEL and plotting the last two columns:

Standard scatter plot

Using EXCEL to take the log(base 2) of the data:

                    digital scatter plot

Now using the "-log" option:
annotatePeaks.pl Oct4.peaks.txt mm8 -size 1000 -log -d H3K4me1-ChIP-Seq/ H3K4me3-ChIP-Seq/ > output.txt

scatter plot with -log

Believe it or not, all of these X-Y plots show the same data.  Interesting, eh?

Finding instances of motifs near peaks

Figuring out which peaks have instances of motifs found with findMotifsGenome.pl is very easy.  Simply use "-m <motif file1> <motif file 2>..." with annotatePeaks.pl.  (Motif files can be concatenated into a single file for ease of use)  This will search for each of these motifs near each peak in your peak file.  Use "-size <#>" to specify the size of the region around the peak center you wish to search.  Found instances of each motif will be reported in additional columns of the output file.  For example:

annotatePeaks.pl pu1peaks.txt mm8 -size 200 -m pu1.motif cebp.motif > output.txt

Opening output.txt with EXCEL:
annotatePeaks.pl motif
                  output examle

Each instance of the motif is specified in the following format (separated by commas):

Distance from Peak Center(Sequence Matching Motif,Strand,Average Conservation)

The average conservation will not be reported unless you specify "-cons".  Also, when finding motifs, the average CpG/GC content will automatically be reported since it has to extract peak sequences from the genome anyway.

There are also a bunch of motif specific options for specialized analysis:

"-norevopp" (only search + strand relative to peak strand for motifs)
"-nmotifs" (just report the total number of motifs per peak)
"-mscore" (report the maximum log-odds score of the motif in each peak)
"-rmrevopp <#>" (tries to avoid double counting reverse opposites within # bp)
"-mdist" (reports distance to closest motif)
"-fm <motif file 1> [motif file 2]" (list of motifs to filter out of results if found)
"-mfasta <filename>" (reports sites in a fasta file - for building new motifs)
"-mbed <filename>" (Output motif positions to a BED file to load at UCSC - see below)
"-matrix <filename>" (outputs a motif co-occurrence matrix with the p-value of co-occurrence assuming instance of each motif are independently distributed amongst the peaks)

Visualizing Motif positions in the UCSC Genome Browser

This feature may seem slightly out of place, but since annotatePeaks.pl is the workhorse of HOMER, you can add "-mbed <filename>" in conjunction with "-m <motif file 1> [motif file 2] ..." to produce a BED file describing motif positions near peaks that can be loaded as a custom track in the UCSC genome browser.  In the example below, you would load motif.bed as a custom track:

annotatePeaks.pl pu1peaks.txt mm8 -size 200 -m pu1.motif cebp.motif -mbed motif.bed > output.txt

                  motif example

Finding the distance to other sets of Peaks

In order to find the nearest peak from another set of peaks, use "-p <peak file 1> [peak file 2] ...".  This will add columns to the output spreadsheet that will specify the nearest peak ID and the distance to that peak.  If all you want is the distance (so you can sort this column), add the option "-pdist" to the command.  Otherwise, if you prefer to count the number of peaks in the peak file found within the indicated regions (i.e. with "-size <#>"), add "-pcount".

Creating Histograms from High-throughput Sequencing data and Motifs

HOMER can be used to make histograms that document sequencing library and motif densities relative to specific positions in the genome.  This can be done near peaks, subsets of peaks, or near promoters, exon junctions or anywhere else you find interesting.  To make histograms, use the annotatePeaks.pl program but add the parameters "-hist <#>" to produce a tab delimited text file that can then be visualized using EXCEL or other data visualization software.

Basic usage:

annotatePeaks.pl <peak file> <genome> -size <#> -hist <#> -d <tag directory 1> [tag directory2] ... -m <motif 1> <motif 2> ... >  <output matrix file>

i.e. annotatePeaks.pl ERpeaks.txt hg18 -size 6000 -hist 25 -d MCF7-H3K4me1/ MCF7-H3K4me2/ MCF7-H3K4me3/ > outputfile.txt

Running this command is very similar to creating annotated peak files - in fact, most of the data can be used to make both types of files - hence the reason for combining this functionality in the same command.  Be default, HOMER normalizes the output histogram such that the resulting units are per bp per peak, on top of the standard total mapped tag normalization of 10 million tags.

Histograms of Tag Directories:

For each tag directory or motif, HOMER will output 3 columns in the histogram.  In the case of tag directories, the first column will indicate ChIP-Fragment Coverage, which is calculated by extending tags by their estimated ChIP-fragment length, and is analogous to the profiles made for the UCSC Genome Browser.  The 2nd and 3rd columns report the density of 5' and 3' aligned tags, and are independent of fragment length.  For example, lets look at H3K4me2 distribution near Androgen Receptor (AR) peaks before and after 16 hours of treatment with testosterone (dht):

annotatePeaks.pl ARpeaks.txt hg18 -size 4000 -hist 10 -d H3K4me2-control/ H3K4me2-dht-16h/ > outputfile.txt

Opening outputfile.txt with EXCEL, we see:

histogram excel

Graphing columns B and E while using column A for the x-coordinates, we get the following:
                H3K4me2 near AR peaks

However, if we graph only the 5' and 3' tags that come from the H3K4me2-dht-16h directory (columns F and G):

                of H3K4me2 tags near AR peaks
Here we can see how the 5' and 3' reads from the H3K4me2 marked nucleosomes are distributed near the AR sites.

Histograms of Motif Densities:

Making histograms out of motif occurrences is very similar to sequencing tag distributions.  Run the annotatePeaks.pl program with "-hist <#>" and "-m <motif file>" (you can also find motif densities and tag densities at the same time):

annotatePeaks.pl ARpeaks.txt hg18 -size 1000 -hist 5 -m are.motif fox.motif ap1.motif > outputfile.txt

Graphing outputfile.txt with EXCEL:

motif distribution example

Centering Peaks on Motifs

One cool analysis strategy is to center peaks on a specific motif.  For example, by centering peak for the Androgen Receptor Transcription Factor on the ARE motif (GNACANNNTGTNC), you can map the spacial relationship between the motif and other sequence features and sequencing reads.

To center peaks on a motif, run annotatePeaks.pl with the following options:

annotatePeaks.pl <peak file> <genome> -size <#> -center <motif file> > newpeakfile.txt

i.e. annotatePeaks.pl ARpeaks.txt hg18r -size 200 -center are.motif > areCenteredPeaks.txt

Now the idea is to use the new peak file to perform analysis:

annotatePeaks.pl areCenteredPeaks.txt hg18 -size 6000 -hist 10 -d H3K4me2-notx/ ... > output.txt

Creating Heatmaps from High-throughput Sequencing data

HOMER is not capable of generating actual Heatmaps per se, but it will generate the data matrix (similar to a gene expression matrix) than can then be visualized using standard gene expression heatmap tools.  For example, I will generate a heatmap data matrix file using HOMER, and then open it with Cluster 3.0 (Micheal Eisen/de Hoon) to cluster it and/or visualize it with Java Tree View (by Alok J. Saldanha).  In reality, you can use any clustering and/or heatmap visualization software (i.e. R).

Basic usage (add "-ghist" when making a histogram):

annotatePeaks.pl <peak file> <genome> -size <#> -hist <#> -ghist -d <tag directory 1> [tag directory2] ... > <output matrix file>

i.e. annotatePeaks.pl ARpeaks.txt hg18 -size 6000 -hist 25 -ghist -d H3K4me2-control/ H3K4me2-dht-16h/ > outputfile.txt

Running this command is very similar to making histograms with annotatePeaks.pl.  In fact, a heatmap isn't really all that different from a histogram - basically, instead of averaging all of the data from each peak, we keep data from each peak separate and visualize it all together in a heatmap.  The key difference when making a heat map or a histogram is that you must add "-ghist" when making a heatmap.

Format of Data Matrix Output File

The resulting file is a tab-delimited text file where the first row is a header file and the remaining rows represent each peak from the input peak file.  The first set of columns will describe the read densities from the first tag directory as a function of distance from the center of the peak, with bin sizes corresponding to the parameter used with "-hist <#>".  After the first block of columns, a second block will start over with read densities from the 2nd tag directory, and so on.  If you would like to cluster this file to help organize the patterns, make sure you only cluster the "genes" (i.e. rows).

Example: annotatePeaks.pl ARpeaks.txt hg18 -size 6000 -hist 25 -ghist -d H3K4me2-control/ H3K4me2-dht-16h/ > outputfile.txt
After creating the file, I loaded into Cluster 3.0 to cluster and clustered the genes using "centered correlation" as the distance metric, then loaded that output into Java Tree View.
heat map example

General Options to Control Data Analysis Behavior

-strand <+|-|both> (only look for motif etc. on + or - strand, default both)
-fragLength <#> (Fragment length, default=auto, might want to set to 0 for RNA)
-pc <#> (maximum number of tags to count per bp, default=0 [no maximum])
-cons (Retrieve conservation information for peaks/sites - creates new column for this information)
-CpG (Calculate CpG/GC content)
-norevopp (do not search for motifs on the opposite strand [works with -center too])
-norm <#> (normalize tags to this tag count, default=1e7, 0=average tag count in all directories)
Use -noadj to disable tag normalization for sequencing depth

Command line options for annotatePeaks.pl

    Usage: annotatePeaks.pl <peak file | tss> <genome version>  [additional options...]

    Available Genomes (required argument): (name,org,directory,default promoter set)

    User defined annotation files (default is UCSC refGene annotation):
        annotatePeaks.pl accepts GTF (gene transfer formatted) files to annotate positions relative
        to custom annotations, such as those from de novo transcript discovery or Gencode.
        -gtf <gtf format file> (-gff and -gff3 can work for those files, but GTF is better)

    Peak vs. tss/tts/rna mode (works with custom GTF file):
        If the first argument is "tss" (i.e. annotatePeaks.pl tss hg18 ...) then a TSS centric
        analysis will be carried out.  Tag counts and motifs will be found relative to the TSS.
        (no position file needed) ["tts" now works too - e.g. 3' end of gene]
        ["rna" specifies gene bodies, will automaticall set "-size given"]
        NOTE: The default TSS peak size is 4000 bp, i.e. +/- 2kb (change with -size option)
        -list <gene id list> (subset of genes to perform analysis [unigene, gene id, accession,
             probe, etc.], default = all promoters)
        -cTSS <promoter position file i.e. peak file> (should be centered on TSS)

    Primary Annotation Options:
        -mask (Masked repeats, can also add 'r' to end of genome name)
        -m <motif file 1> [motif file 2] ... (list of motifs to find in peaks)
            -mscore (reports the highest log-odds score within the peak)
            -nmotifs (reports the number of motifs per peak)
            -mdist (reports distance to closest motif)
            -mfasta <filename> (reports sites in a fasta file - for building new motifs)
            -fm <motif file 1> [motif file 2] (list of motifs to filter from above)
            -rmrevopp <#> (only count sites found within <#> on both strands once, i.e. palindromic)
            -matrix <prefix> (outputs a motif co-occurrence files:
                prefix.count.matrix.txt - number of peaks with motif co-occurrence
                prefix.ratio.matrix.txt - ratio of observed vs. expected  co-occurrence
                prefix.logPvalue.matrix.txt - co-occurrence enrichment
                prefix.stats.txt - table of pair-wise motif co-occurrence statistics
                additional options:
                -matrixMinDist <#> (minimum distance between motif pairs - to avoid overlap)
                -matrixMaxDist <#> (maximum distance between motif pairs)
            -mbed <filename> (Output motif positions to a BED file to load at UCSC (or -mpeak))
        -d <tag directory 1> [tag directory 2] ... (list of experiment directories to show
            tag counts for) NOTE: -dfile <file> where file is a list of directories in first column
        -bedGraph <bedGraph file 1> [bedGraph file 2] ... (read coverage counts from bedGraph files)
        -wig <wiggle file 1> [wiggle file 2] ... (read coverage counts from wiggle files)
        -p <peak file> [peak file 2] ... (to find nearest peaks)
            -pdist to report only distance (-pdist2 gives directional distance)
            -pcount to report number of peaks within region
        -vcf <VCF file> (annotate peaks with genetic variation infomation, one col per individual)
            -editDistance (Computes the # bp changes relative to reference)
            -individuals <name1> [name2] ... (restrict analysis to these individuals)
        -gene <data file> ... (Adds additional data to result based on the closest gene.
            This is useful for adding gene expression data.  The file must have a header,
            and the first column must be a GeneID, Accession number, etc.  If the peak
            cannot be mapped to data in the file then the entry will be left empty.
        -go <output directory> (perform GO analysis using genes near peaks)
        -genomeOntology <output directory> (perform genomeOntology analysis on peaks)
            -gsize <#> (Genome size for genomeOntology analysis, default: 2e9)

    Annotation vs. Histogram mode:
        -hist <bin size in bp> (i.e 1, 2, 5, 10, 20, 50, 100 etc.)
        The -hist option can be used to generate histograms of position dependent features relative
        to the center of peaks.  This is primarily meant to be used with -d and -m options to map
        distribution of motifs and ChIP-Seq tags.  For ChIP-Seq peaks for a Transcription factor
        you might want to use the -center option (below) to center peaks on the known motif
        ** If using "-size given", histogram will be scaled to each region (i.e. 0-100%), with
        the -hist parameter being the number of bins to divide each region into.
            Histogram Mode specific Options:
            -nuc (calculated mononucleotide frequencies at each position,
                Will report by default if extracting sequence for other purposes like motifs)
            -di (calculated dinucleotide frequencies at each position)
            -histNorm <#> (normalize the total tag count for each region to 1, where <#> is the
                minimum tag total per region - use to avoid tag spikes from low coverage
            -ghist (outputs profiles for each gene, for peak shape clustering)
            -rm <#> (remove occurrences of same motif that occur within # bp)

    Peak Centering: (other options are ignored)
        -center <motif file> (This will re-center peaks on the specified motif, or remove peak
            if there is no motif in the peak.  ONLY recentering will be performed, and all other
            options will be ignored.  This will output a new peak file that can then be reanalyzed
            to reveal fine-grain structure in peaks (It is advised to use -size < 200) with this
            to keep peaks from moving too far (-mirror flips the position)
        -multi (returns genomic positions of all sites instead of just the closest to center)

    Advanced Options:
        -len <#> / -fragLength <#> (Fragment length, default=auto, might want to set to 0 for RNA)
        -size <#> (Peak size[from center of peak], default=inferred from peak file)
            -size #,# (i.e. -size -10,50 count tags from -10 bp to +50 bp from center)
            -size "given" (count tags etc. using the actual regions - for variable length regions)
        -log (output tag counts as log2(x+1+rand) values - for scatter plots)
        -sqrt (output tag counts as sqrt(x+rand) values - for scatter plots)
        -strand <+|-|both> (Count tags on specific strands relative to peak, default: both)
        -pc <#> (maximum number of tags to count per bp, default=0 [no maximum])
        -cons (Retrieve conservation information for peaks/sites)
        -CpG (Calculate CpG/GC content)
        -ratio (process tag values as ratios - i.e. chip-seq, or mCpG/CpG)
        -nfr (report nuclesome free region scores instead of tag counts, also -nfrSize <#>)
        -norevopp (do not search for motifs on the opposite strand [works with -center too])
        -noadj (do not adjust the tag counts based on total tags sequenced)
        -norm <#> (normalize tags to this tag count, default=1e7, 0=average tag count in all directories)
        -pdist (only report distance to nearest peak using -p, not peak name)
        -map <mapping file> (mapping between peak IDs and promoter IDs, overrides closest assignment)
        -noann, -nogene (skip genome annotation step, skip TSS annotation)
        -homer1/-homer2 (by default, the new version of homer [-homer2] is used for finding motifs)

Can't figure something out? Questions, comments, concerns, or other feedback: