Software for motif discovery and ChIP-Seq analysis

This is the old version of the documentation: New Version

ChIP-Seq Analysis: Finding Peaks (ChIP-enriched Regions)

Finding peaks is one of the central goals of any ChIP-Seq experiment.  The basic idea is to identify regions in the genome where we find more ChIP-Seq tags than we would expect to see by chance.  There are number of different approaches one can use to find ChIP-Seq peaks, and correspondingly there are many different methods for identifying peaks from ChIP-Seq experiments.  It is not required that you use HOMER for peak finding to use the rest of the tools included in HOMER (see below).

The most important distinction between HOMER and several other peak finding algorithms is that by default HOMER searches for peaks of FIXED SIZE.  This is ideal for achieving maximum sensitivity when finding peaks for Transcription Factors or other proteins that make contact at a single point along the DNA.  Peaks at these locations should contain nearly identical tag distributions (i.e. since ChIP-fragments are only ~150 bp, binding to a specific site should not (in principle) produce peaks wider than the fragment length in either direction).

The default settings in HOMER is not particularly good at identifying "enriched blocks" which are typical of repressive histone modifications such as H3K9me3 and H3K27me3.  HOMER will find regions that are enriched regardless of shape and size, but if your research requires that you accurately identify the boundaries of large enrichment blocks of varying size, try using the option "-region". (See Finding Regions of Variable Length)  Histone marks that are more "focal", such as H3K4me1, H3K4me2, H3K4me3 or H3/H4 acetylation are easily found with HOMER when using a larger peak size (i.e. -size 1000)

The other reason HOMER uses fixed size peaks is that the assumptions needed for Motif Finding are, let's say, less violated, when dealing with regions of constant size.

Finding Transcription Factor Peaks with HOMER

To find peaks for a transcription factor use the findPeaks command:

findPeaks <tag directory> -center -o <output peak file>

i.e. findPeaks ERalpha-ChIP-Seq/ -center -o ERalpha.peaks.txt

Where the first argument must be the tag directory (required).  The "-center" option ensures the the peak is place on the region of maximum ChIP-fragment density, and is recommended when analyzing ChIP-Seq for transcription factors.  It is highly recommended that you develop a strategy for keeping track of these files.  For example, it's usually nice to place peak files in the tag directory they are associated with so that they are easy to find later.  For example:

findPeaks ERalpha-ChIP-Seq/ -center -o ERalpha-ChIP-Seq/peaks.txt

Homer attempts to automatically determine parameters needed for peak finding - most importantly the size of the peak and the length of ChIP-fragments (determined from the tag auto correlation, stored as parameters in the <tag directory>/tagInfo.txt file).  These can be overwritten using the "-size <#>" and "-fragLength <#>" options.  You can also set the minimum distance between peaks using "-minDist <#>" (default is 2.5x the peak size).  Strand specific peak finding is also available ("-strand separate").

Be default, HOMER assumes an effective genome size of 2 billion (fine for human or mouse).  This can be explicitly set using the "-gsize <#>" option.  HOMER determines the threshold of tags needed to call a peak significant by assuming that non-enriched ChIP-fragment concentrations are approximated by the Poisson distribution, which is similar to just about every other method out there.  The cutoff for statistically significant peaks can be specified by "-fdr <#>" or with "-poisson <#>".

Output: the Peak file

HOMER outputs peak information in the form of an text file which is easily opened with EXCEL.  A typical peak file will look something like this:
peak file

The first several rows start with a "#", which contains meta information about the peak finding procedure.  Below this information are the peaks, listed in each row.  Columns contain information about each peak:
  • Column 1: PeakID - a unique name for each peak (very important that peaks have unique names...)
  • Column 2: chr - chromosome where peak is located
  • Column 3: starting position of peak
  • Column 4: ending position of peak
  • Column 5: Strand (+/-)
  • Column 6: Score - number of tags found clustered together (may differ from total tag count depending on how many tags were considered per bp)
  • Column 7: Focus Ratio - fraction of tags found appropriately upstream and downstream of the peak center. Describes the chances that factor binds a single spot on the DNA. (will be set to zero if -center was NOT used)
  • Columns 8+: Statistics and Data from filtering
Two generic tools are available as part of HOMER to convert peak files to BED files and back.  This will allow you to upload your peak files to the UCSC Genome Browser, or convert peak files in BED format from another program into a peak file that can be used by HOMER.  These programs are named pos2bed.pl and bed2pos.pl, which can be used the following way:

pos2bed.pl peakfile.txt > peakfile.bed
bed2pos.pl peakfile.bed > peakfile.txt

How findPeaks works

findPeaks initially works by analyzing tag positions as opposed to ChIP-fragment densities.  This is partially done to keep the program flexible for various tasks.  In order to analyze ChIP-Seq data for peaks, the first step is to "adjust" the positions of tags to the center of their ChIP-fragments so that tag densities can be more accurately calculated.  This tag position adjustment is half of the estimated ChIP-fragment length in the 3' direction relative to the original position of the tag.

The program then sorts tags along each chromosome and efficiently looks at each tag position to calculate the number tags found within the designated peak size.  It then sorts tag positions from those with the highest density of tags through those with the lowest, assigning putative peaks and masking nearby positions within the minimum distance acceptable for neighboring peaks (-minDist).  Positions that are masked essentially represent positions near a better candidate peak, and as the program encounters masked positions, it further masked positions near positions that were already masked to ensure peaks identified come from local maxima.  After cycling through all tag positions, a putative list of peaks is generated representing local maximum in tag distributions.  These are subjected to additional filters based on input sequencing, etc.

Filtering Peaks

The initial step of peak finding is to find non-random clusters of tags, but in many cases these clusters may not be representative of try transcription factor binding events.  To increase the overall quality of peaks identified by HOMER, 3 separate filtering steps can be applied to the initial peaks identified:

Using Input/IgG Sequencing as a Control

To use an Input or IgG sequencing run as a control (HIGHLY RECOMMENDED), you must first create a separate tag directory for the input experiment (see here).  Additionally, you can use other cleaver experiments as a control, such as a ChIP-Seq experiment for the same factor in another cell or in a knockout.  To find peaks using a control, type:

findPeaks <tag directory> -center -i <control tag directory> -o <output peak file>

i.e. findPeaks ERalpha-ChIP-Seq/ -i Input-ChIP-Seq/ -center -o ERalpha.peaks.txt

HOMER offers two ways to filter peaks against a control experiment.  First, if you prefer a simple fold change, HOMER will require that each putative peak be greater than 4-fold (or specify with "-F <#>").  In the case where there are no input tags near the putative peak, HOMER automatically sets these regions to be set to the average input tag coverage to avoid dividing by zero.  Alternatively, HOMER can adhere to more statistically rigorous measures of differential tag enrichment by using the poisson distribution to model peak similarity.  To filter peaks based on their poisson p-value, use "-F <#>" but enter a number between 0 and 1 (i.e. "-F 0.0001").

Filtering Based on Local Signal

Our experience with peak finding is that often putative peaks are identified in regions of genomic duplication, or in regions where the reference genome likely differs from that of the genome being sequenced.  This produces large regions of high tag counts, and if no Input/IgG sample is available, it can be hard to exclude these regions.  Also, it may be advantageous to remove putative peaks that a spread out over larger regions as it may be difficult to pin-point the important regulatory regions within them.

To deal with this, HOMER will filter peaks based on the local tag counts (similar in principle to MACS).  Be default, HOMER requires the tag density at peaks to be 4-fold greater than in the surrounding 10 kb region.  This can be modified using "-L <#>" and "-localSize <#>" to change the fold threshold and size of the local region, respectively.  Alternatively, a poisson p-value threshold can be set using "-L <#>" between 0 and 1.

Filtering Based on Clonal Signal

When we first sifted through peaks identified in ChIP-Seq experiments we noticed there are many peaks near repeat elements that contain odd tag distributions.  These appear to arise from expanded repeats that result in peaks with high numbers of tags from only a small number of unique positions, even when many of the other positions withing the region may be "mappable".  To help remove these peaks, HOMER will compare the number of unique positions containing tags in a peak relative to the expected number of unique positions given the total number of tags in the peak.  If the ratio between the later and the former number gets to high, the peak is discarded.  The fold threshold can be set with the "-C <#>" option (default: "-C 2").  Homer uses the averageTagsPerPosition parameter in the tagInfo.txt file adjust this calculation as to not over-penalize ChIP-Seq experiments that are already highly "clonal".  If analyzing MNase or other restriction enzyme digestion experiments turn this option off ("-C 0");

Disabling Filtering

To disable Input, Local, or Clonal filtering set any combination of "-F 0 -L 0 -C 0".

Finding Peaks of fixed length using Histone Modification Data.

Finding peaks using histone modification data can be a little tricky - largely because we have very little idea what the histone marks actually do.  If you want to find peaks in histone modification data with the purpose of analyzing them for enriched motifs, read this section.  If you are trying to annotate regions of the genome that are covered by a given mark, read the next section.  The problem with histone modification data (and some other types) is that the signal can spread over large distances.  Trying to analyze large, variable length regions for motif enrichment is very difficult and not recommended.

There are two important differences between finding fixed width peaks for transcription factors and histone modifications.  The first is the size of the peaks: Most histone modifications should be analyzed using a peak size in the range of 500-2000 usually (i.e. "-size 1000").  The other is that you should omit the "-center" option.  Since you are looking at a region, you do not necessarily want to center the peak on the specific position with the highest tag density, which may be at the edge of the region.  Besides, in the case of histone modifications at enhancers, the highest signal will usually be found on nucleosomes surrounding the center of the enhancer, which is where the functional sequences and transcription factor binding sites reside.  Typically, adding the -center option moves peaks further away from the functional sequence in these scenarios.  An example for finding peaks:

findPeaks H3K4me1-ChIP-Seq/ -i Input-ChIP-Seq -size 1000 -o H3K4me1-ChIP-Seq/peaks.1kb.txt

One issue with finding histone modification peaks using the defaults in HOMER is that the Local filtering step removes several of the peaks due to the "spreading" nature of many histone modifications.  This can be good and bad.  If you are looking for nice concentrated regions of modified histones, the resulting peaks will be a nice set for further analysis such as motif finding.  However, if you are looking to identify every region in the cell that has an appreciable amount of modified histone, you may want to disable local filtering, or consider using the "-region" option below  e.g.:

findPeaks H3K4me1-ChIP-Seq/ -i Input-ChIP-Seq -size 1000 -L 0 -o H3K4me1-ChIP-Seq/peaks.1kb.txt

Also, if using MNase-treated chromatin (i.e. nucleosome mapping), you may want to use "-C 0" to avoid filtering peaks composed of highly clonal tag positions.  These can arise from well positioned nucleosomes (or sites that are nicely digested by MNase)

Finding Enriched Regions of Variable Length

If the option "-region" is specified, findPeaks will stitch together enriched peaks into regions.  Note that local filtering is turned off when finding regions.  The most important parameters for region finding are the "-size" and "-minDist" (and of course the fragment length).  First of all, "-size" specifies the width of peaks that will form the basic building blocks for extending peaks into regions.  Smaller peak sizes offer better resolution, but  larger peak sizes are usually more sensitive.  By default findPeaks uses the estimated peak size from the autocorrelation analysis, although you may want to use a larger peak size for histone modifications (i.e. 500 or 1000) where the minimum size of peaks observed are usually fairly large anyway. 

The second parameter, "-minDist", is usually used to specify the minimum distance between adjacent peaks.  If "-region" is used, this parameter then specifies the maximum distance between putative peaks that is allowed if they are to be stitched together to form a region.  By default this is 2.5x the peak size.  If you think about histone modifications, the signal is never continuous in enriched regions, with reduced signal due to non-unique sequences (that can't be mapped to) and nucleosome depleted regions.  "-minDist" informs findPeaks how big of a gap in the signal will be tolerated for adjacent peaks to be considered part of the same region.

One thing to note is that you may have to play around with these parameters to get the results you want.  If you look at the examples below, you could make arguments for using each of the tracks given what you're interested in and how you would define a "region".

For example: (in the example below, the default size comes from the autocorrelation estimate for the Macrophage-H3K4me1 dataset)

Default Parameters:
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -size 150 -minDist 370 > output.txt (i.e. defaults)

Recommend Parameters for fixed width peaks (i.e. for motif finding):
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -size 1000 -minDist 2500 > output.txt

Default Parameters for variable length peaks.
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 150 -minDist 370 > output.txt

Effect on variable length peaks if we increase minDist to 1000.
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 150 -minDist 1000 > output.txt

Recommend Parameters for variable length peaks (H3K4me1 at least).
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 1000 -minDist 2500 > output.txt

Effect on variable peaks if we increase minDist to 10000 (H3K4me1 at least).
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 1000 -minDist 10000 > output.txt

Variable length peaks example

findPeaks can also be used to define transcripts/exons from RNA-Seq and it's variants.  The only catch is that HOMER won't figure out the splicing.  However, for methods such as GRO-Seq (global nuclear run-on sequencing) this can be useful for getting a sense for what the transcripts are since we don't have to worry about introns in this type of data.  If you sequencing is strand specific, make sure to use "-strand specific".  Also, it's recommended to disable clonal filtering when analyzing RNA ("-C 0").

findPeaks GRO-Seq -strand separate -size 200 -fragLength 100 -minDist 2000 -region -C 0 > transcripts.txt

RNA seq example

Peak Centering and Focus Ratios

If the option "-center" is specified, findPeaks will calculate the position within the peak with the maximum ChIP-fragment overlap and calculate a focusRatio for the peak.  This is not always desired (such as with histone modifications).  The focus ratio is defined as the ratio of tags located 5' of the peak center on either strand relative to the total number of tags in the peak.  Peaks that contain tags in the ideal positions are more likely to be centered on a single binding site, and these peaks can be used to help determine what sequences are directly bound by a transcription factor.  Unfocused peaks, or peaks with low (i.e <80%) focusRatios may be the result of several closely spaced binding sites or large complexes that crosslink to multiple positions along the DNA.
focused peak schematic

To isolate focused peaks, you can use the getFocalPeaks.pl tool:

getFocalPeaks.pl <peak file> <focus % threshold> > focalPeaksOutput.txt

i.e. getFocalPeaks.pl ERpeaks.txt 0.90 > ERfocalPeaks.txt

Peak finding and Sequencing Saturation

HOMER does not try to estimate sequencing saturation, which is the estimate of whether or not you have sequenced enough tags to identify all the peaks in a given experiment.  Generally speaking, if you sequence more, you will get more peaks since your sensitivity will increase.  The only real way to assess this is if you can somehow show that all of the "functional" or "real" peaks have high tag counts (i.e. are well above the threshold for identifying a peak), meaning that sequencing more is not likely to identify more "real" peaks.  This generally cannot be determined by simply resampling the data and repeating the peak finding procedure - you need some sort of outside information to assess peak quality, such as motif enrichment or something else.  Simply resampling will likely only tell you if you've gotten to the point where you are simply resequencing the same fragments again - i.e. becoming clonal.

Command line options for findPeaks

        Usage: findPeaks <tag directory> [options]

        Finds peaks in the provided tag directory.  By default, peak list printed to stdout

        General Options:
                -o <output file> (default: stdout)
                -i <input tag directory> (Experiment to use as IgG/Input/Control)
                -size <#> (Peak size, default: auto)
                -minDist <#> (minimum distance between peaks, default: peak size x2.5)
                -gsize <#> (Set effective genome size, default: 2e9)
                -fragLength <#|auto> (Approximate fragment length, default: auto)
                -inputFragLength <#|auto> (Approximate fragment length of input tags, default: auto)
                -tbp <#> (Maximum tags per bp to count, 0 = no limit, default: auto)
                -inputtbp <#> (Maximum tags per bp to count in input, 0 = no limit, default: auto)
                -strand <both|separate> (find peaks using tags on both strands or separate, default:both)
                -norm # (Tag count to normalize to, default 10000000)
        Peak localization/shape options:
                -center (Centers peaks on maximum tag overlap and calculates focus ratios)
                -region (extends start/stop coordinates to cover full region considered "enriched")
        Peak Filtering options: (set -F/-L/-C to 0 to skip)
                -F <#> (fold enrichment over input tag count, default: 4.0)
                        if set between 0 and 1, poisson p-value cutoff will be used, i.e. 0.001
                -L <#> (fold enrichment over local tag count, default: 4.0)
                        if set between 0 and 1, poisson p-value cutoff will be used, i.e. 0.001
                -C <#> (fold enrichment limit of expected unique tag positions, default: 2.0)
                -localSize <#> (region to check for local tag enrichment, default: 10000)
                -fdr <#> (False discovery rate, default = 0.001)
                -poisson <#> (Set poisson p-value cutoff, default: uses fdr)
                -tagThreshold <#> (Set # of tags to define a peak, default: uses fdr)

Links to alternative peak finding software

Lots of quality programs exist for finding peaks in ChIP-Seq data.  Most use slightly different assumptions and peak definitions that result in slightly different sets of peaks.  Many of these programs output peak files in BED format.  To covert these to HOMER peak file format, use the bed2pos.pl program to convert the file:

bed2pos.pl peaks.bed > peaks.txt

Peak finding software links:

Next: Finding Motifs in ChIP-Seq Peaks

Can't figure something out? Questions, comments, concerns, or other feedback: