Software for motif discovery and ChIP-Seq analysis
This is the old version of
the documentation: New
Alignment of High-throughput Sequencing DataHomer does not perform alignment - this is something that must be done before running homer. Several quality tools are available for alignment of short reads to large genomes. Check out this link for a list of programs that do short read alignment. BLAST, BLAT, and other traditional alignment programs, while great at what they do, are not practical for alignment of these types of data.
If you need help deciding on a program to use, I'll recommend bowtie (it's nice and fast).
If you have a core that maps your data for you, don't worry about this step. However, in many cases there is public data available that hasn't been mapped to the genome or mapped to a different version of the genome or mapped with different parameters. In these cases it is nice to be able to map data yourself to keep a nice, consistent set of data for analysis.
Which reference genome (version) should I map my reads to?Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads. Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a genome):
Should I trim my reads when mapping to the genome?Depends. In the old days, the read quality dropped off quite a bit past ~30 bp, but these days even the end of sequencing reads are pretty high quality. In the end, I would recommend mapping ~32 bp reads with up the 3 mismatches, using only the uniquely alignable reads for downstream analysis. That will give you access to probably 80-90% of what is interesting in your data set.
Example - Alignment with bowtie:
Step 1 - Build Index (takes a while, but only do this once):
After installing bowtie, the reference genome must first be "indexed" so that reads may be quickly aligned. You can download pre-made indecies from the bowtie website (check for those here first). Otherwise, to perform make your own from FASTA files, do the following:
Step 2 - Align sequences with bowtie (perform for each experiment):
The most common output format for high-throughput sequencing is FASTQ format, which contains information about the sequence (A,C,G,Ts) and quality information which describes how certain the sequencer is of the base calls that were made. In the case of Illumina sequencing, the output is usually a "s_1_sequence.txt" file. In addition, much of the data available in the SRA, the primary archive of high-throughput sequencing data, is in this format. To map this data, run the following command:
/path-to-bowtie-programs/bowtie -q --best -m 1 -p <# cpu> <genome> <fastq file> <output filename>
Where <genome> would be hg18 from the index made above, <fastq file> could be "s_1_sequence.txt", and <output filename> something like "s_1_sequence.hg18.alignment.txt"
The parameters "--best" and "-m 1" are needed to make sure bowtie outputs only unique alignments. There are many options and many different ways to perform alignments, with different trade-offs for different types of projects - well beyond the scope of what I am describing here.
NOTE: HOMER contains automated parsing for uniquely aligned reads from output files generated with bowtie in this fashion. Homer also accepts *eland_result.txt and *_export.txt formats from the Illumina pipeline. If different programs are used, or special parsing of output files are needed, please parse/reformat alignment files to general BED format, which is also accepted by HOMER.
Can't figure something out? Questions, comments, concerns, or other feedback: