Software for motif discovery and ChIP-Seq analysis

This is the old version of the documentation: New Version

Alignment of High-throughput Sequencing Data

Homer does not perform alignment - this is something that must be done before running homer.  Several quality tools are available for alignment of short reads to large genomes.  Check out this link for a list of programs that do short read alignment.  BLAST, BLAT, and other traditional alignment programs, while great at what they do, are not practical for alignment of these types of data.

If you need help deciding on a program to use, I'll recommend bowtie (it's nice and fast).

If you have a core that maps your data for you, don't worry about this step.  However, in many cases there is public data available that hasn't been mapped to the genome or mapped to a different version of the genome or mapped with different parameters.  In these cases it is nice to be able to map data yourself to keep a nice, consistent set of data for analysis.

Which reference genome (version) should I map my reads to?

Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads.  Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a genome):
  1. Do you have a favorite genome in the lab that already has a bunch of experiments mapped to it?  Use that one.
  2. Do any of your collaborators have a favorite genome?
  3. Use the latest stable release - I would recommend using genomes curated at UCSC so that you can easily visualize your data later using the UCSC Genome Browser.  (i.e. mm9, hg18)

Should I trim my reads when mapping to the genome?

Depends.  In the old days, the read quality dropped off quite a bit past ~30 bp, but these days even the end of sequencing reads are pretty high quality.  In the end, I would recommend mapping ~32 bp reads with up the 3 mismatches, using only the uniquely alignable reads for downstream analysis.  That will give you access to probably 80-90% of what is interesting in your data set.

Example - Alignment with bowtie:

Step 1 - Build Index (takes a while, but only do this once):

After installing bowtie, the reference genome must first be "indexed" so that reads may be quickly aligned.  You can download pre-made indecies from the bowtie website (check for those here first).  Otherwise, to perform make your own from FASTA files, do the following:
  1. Download FASTA files for the unmasked genome of interest if you haven't already (i.e. from UCSC)
  2. From the directory containing the FASTA files, run the "bowtie-build" command.  For example, for hg18:
    • /path-to-bowtie-programs/bowtie-build chr1.fa,chr2.fa,chr3.fa,...chrY.fa,chrM.fa hg18
    • Where ... are the rest of the *.fa files.  This command will take a long time to run, but will produce several files named hg18.*.ebwt
  3. Copy the *.ebwt files to the bowtie indexes directory so that bowtie knows where to find them later:
    • cp *.ebwt /path-to-bowtie-programs/indexes/

Step 2 - Align sequences with bowtie (perform for each experiment):

The most common output format for high-throughput sequencing is FASTQ format, which contains information about the sequence (A,C,G,Ts) and quality information which describes how certain the sequencer is of the base calls that were made.  In the case of Illumina sequencing, the output is usually a "s_1_sequence.txt" file.  In addition, much of the data available in the SRA, the primary archive of high-throughput sequencing data, is in this format.  To map this data, run the following command:

/path-to-bowtie-programs/bowtie -q --best -m 1 -p <# cpu> <genome> <fastq file> <output filename>
Where <genome> would be hg18 from the index made above, <fastq file> could be "s_1_sequence.txt", and <output filename> something like "s_1_sequence.hg18.alignment.txt"

The parameters "--best" and "-m 1" are needed to make sure bowtie outputs only unique alignments.  There are many options and many different ways to perform alignments, with different trade-offs for different types of projects - well beyond the scope of what I am describing here.

NOTE: HOMER contains automated parsing for uniquely aligned reads from output files generated with bowtie in this fashion.  Homer also accepts *eland_result.txt and *_export.txt formats from the Illumina pipeline.  If different programs are used, or special parsing of output files are needed, please parse/reformat alignment files to general BED format, which is also accepted by HOMER.

Back to ChIP-Seq Analysis 

Can't figure something out? Questions, comments, concerns, or other feedback: