Software for motif discovery and ChIP-Seq analysis

This is the old version of the documentation: New Version

ChIP-Seq Analysis: Finding Enriched Motifs in ChIP-Seq Peaks

HOMER was initially developed to automate the process of finding enriched motifs in ChIP-Seq peaks.  More generally, HOMER analyzes genomic positions, not limited to only ChIP-Seq peaks, for enriched motifs.  The main idea is that all the user really needs is a file containing genomic coordinates (i.e. peak file), and HOMER will generally take care of the rest.  To analyze a peak file for motifs, run the following command:

findMotifsGenome.pl <peak file> <genome> <output directory> [options]

i.e. findMotifsGenome.pl ERpeaks.txt hg18r ER_MotifOutput/ -len 8,10,12

A variety of output files will be placed in the <output directory>, including html pages showing the results.

The findMotifsGenome.pl program is a wrapper that helps set up the data for analysis using the HOMER motif discovery algorithm.  By default this will perform de novo motif discovery as well as check the enrichment of known motifs.  If you have not done so already, please look over this page describing how HOMER analyzes sequences for enriched motifs.

An important prerequisite for analyzing genomic motifs is that the appropriate genome must by configured for use with HOMER.

Acceptable Input files

findMotifsGenome.pl accepts files in HOMER's "peak file" format.  The minimum file requirements are as follows (separated by TABs):
  • Column1: Unique Peak ID
  • Column2: chromosome
  • Column3: starting position
  • Column4: ending position
  • Column5: Strand (+/- or 0/1, where 0="+", 1="-")
Additional columns will be ignored.  If starting with a BED file, convert it to a peak file using the bed2pos.pl program.  If using a EXCEL, make sure to save files as a "Text (Windows)" if running MacOS. If errors occur, it is likely that the file is not in the correct format, or the first column is not actually populated with unique identifiers.

!!! COMMON PROBLEM: If this program isn't working, make sure you save your peak files as "text (windows)" from EXCEL when on a Mac.  Run the checkPeakFile.pl program to see if your file is the correct format, and use changeNewLine.pl if you didn't save your file in "text (windows)" format.

!!! OK - an even MORE common PROBLEM, particularly if you use a different peak finding program, make SURE you use UNIQUE peak IDs.  If you think about it, the point of having a peak ID is so that you can tell them apart, so having duplicates is a horrible idea!!!  Repeated peak IDs will cause older verions of HOMER to crash!!  The program renamePeaks.pl is now included to rename the peaks if you're you need help with this.

Important motif finding parameters

Region Size ("-size <#>") - this specifies the size (centered on the peak centers) to look for motifs.  I'd recommend 50 bp for establishing the primary motif bound by a given transcription factory, 200 bp for finding "co-enriched" motifs for a transcription factor, and 1000 bp for searching H3K4me or H3/H4 acetylated regions.

Motif length ("-len <#>" or "-len <#>,<#>,...") - specify the length of motifs to be found.  HOMER will find motifs of each size separately and then combine the results at the end.  The length of time it takes to find motifs increases greatly with increasing size.  In general, it's best to try out enrichment with shorter lengths (i.e. 8 and/or 10) before trying longer lengths (i.e. 12 or 14).  Much longer motifs can be found with HOMER, but it's best to use smaller sets of sequence when trying to find long motifs (i.e. use "-len 20 -size 50"), otherwise it may take way too long (or take too much memory).

Number of motifs to find ("-S <#>") - specifies the number of motifs of each length to find. (recommend "-S 50" or "-S 100").

Normalize GC% content instead of CpG% content ("-gc"), or disable GC/CpG normalization ("-noweight").

Use custom background regions ("-bg <peak file of background regions>") - these will still be normalized for CpG% or GC% content just like randomly chosen sequences

By default, findMotifsGenome.pl uses the binomial distribution to score motifs.  This works well when the number of background sequences greatly out number the target sequences - however, if you are using "-bg" option above, and the number of background sequences is smaller than target sequences, it is a good idea to use the hypergeometric distribution instead ("-h").

Find enrichment of individual oligos ("-oligo").  This creates output files in the output directory named oligo.length.txt.

Force findMotifsGenome.pl to re-preparse genome for the given region size ("-preparse").

How findMotifsGenome.pl works

There are a series of steps that the program goes through to find quality motifs:
  1. Extract sequences from the genome corresponding to the peaks in the input file
  2. Removes sequences with >70% Ns
  3. Calculate the CpG/GC content of input sequences
  4. (If not done during a previous run) Preparse genome for control fragments of the specified size
  5. Randomly select background sequences matching CpG characteristics of input sequences
  6. Perform de novo motif finding
  7. Generate output files for de novo motif finding
  8. Check enrichment of known motifs
  9. Generate output files for known motif enrichment

Interpreting motif finding results

The format of the output files generated by findMotifsGenome.pl are identical to those generated by the promoter-based version findMotifs.pl (description).

In general, when analyzing ChIP-Seq / ChIP-Chip peaks you should expect to see strong enrichment for a motif resembling the site recognized by the DNA binding domain of the factor you are studying.  Enrichment p-values reported by HOMER should be very very significant (i.e. << 1e-50).  If this is not the case, there is a strong possibility that the experiment may have failed in one way or another.  For example, the peaks could be of low quality because the factor is not expressed very high.

Practical Tips for Motif finding

Command-line options for findMotifsGenome.pl

    Program will find de novo and known motifs in regions in the genome

    Usage: findMotifsGenome.pl <pos file> <genome> <output directory> [additional options]
    Example: findMotifsGenome.pl peaks.txt mm8r peakAnalysis -size 200 -len 8

    Basic options:
        -bg <background position file> (genomic positions to be used as background, default=automatic)
        -len <#>[,<#>,<#>...] (motif length, default=10) [NOTE: values greater 12 may cause the program
            to run out of memory - in these cases decrease the number of sequences analyzed (-N)]
        -size <#> (fragment size to use for motif finding, default=200)
        -S <#> (Number of motifs to optimize, default: 100)
        -mis <#> (global optimization: searches for strings with # mismatches, default: 2)
        -depth [low|med|high|allnight] (time spent on local optimization default: med)

    Scanning sequence for motifs
        -find <motif file> (This will cause the program to only scan for motifs)

    Known Motif Options
        -mcheck <motif file> (known motifs to check against de novo motifs,
            default: /bioinformatics/homer/data/knownTFs/all.motifs
        -mknown <motif file> (known motifs to check for enrichment,
            default: /bioinformatics/homer/data/knownTFs/known.motifs

    Sequence normalization options:
        -tss (normalize based on distance from TSS)
        -cgtss (normalize based on CpG content and distance from TSS)
        -cg DEFAULT (normalize based on CpG content)
        -noweight (no CG correction)

    Advanced options:
        -h (use hypergeometric for p-values, binomial is default)
        -N <#> (Number of sequences to use for motif finding, default=max(50k, 2x input)
        -noforce (will attempt to reuse sequence files etc. that are already in output directory)
        -local <#> (use local background, # of equal size regions around peaks to use i.e. 2)
        -gc (use GC% instead of CpG% for sequence content normalization [NOT WORKING...]
        -noknown (don't search for known motif enrichment)
        -nocheck (don't search for de novo vs. known motif similarity)
        -nomotif (don't search for de novo motif enrichment)
        -norevopp (don't search reverse strand for motifs)
        -redundant <#> (Remove redundant sequences matching greater than # percent, i.e. -redundant 0.5)
        -float (allow Homer to adjust the degeneracy threshold for known motifs to get best p-value[dangerous])
        -mask <motif file1> [motif file 2]... (motifs to mask before motif finding)
        -refine <motif file1> (motif to optimize)
        -rand (randomize target and background sequences labels)
        -ref <peak file> (use file for target and background - first argument is list of peak ids for targets)
        -oligo (perform analysis of individual oligo enrichment)
        -dumpFasta (Dump fasta files for target and background sequences for use with other programs)
        -preparse (force new background files to be created)

Next: Annotating Peaks

Can't figure something out? Questions, comments, concerns, or other feedback: