Software for motif discovery and next-gen sequencing analysis

Also check out this page in the Basic NGS Tutorial (It's more up-to-date)

Alignment of High-throughput Sequencing Data

Homer does not perform alignment - this is something that must be done before running homer.  Several quality tools are available for alignment of short reads to large genomes.  Check out this link for a list of programs that do short read alignment.  BLAST, BLAT, and other traditional alignment programs, while great at what they do, are not practical for alignment of these types of data.

If you need help deciding on a program to use, I'll recommend Bowtie (it's nice and fast).

If you have a core that maps your data for you, don't worry about this step.  However, in many cases there is public data available that hasn't been mapped to the genome or mapped to a different version of the genome or mapped with different parameters.  In these cases it is nice to be able to map data yourself to keep a nice, consistent set of data for analysis.

Most types of ChIP-Seq/DNase-Seq/MNase-Seq and GRO-Seq simply need to be mapped to the genome, as they represent the sequencing of genomic DNA (or nascent RNA, which should not be spliced yet).  If analyzing RNA-Seq, you may be throwing away interesting information about splicing if you simply align the data to the genome.  If aligning RNA, I'd recommend sticking to the formal wear and trying Tophat, which does a good job of identifying splice junctions in your data.

Which reference genome (version) should I map my reads to?

Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads.  Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a genome):
    1. Do you have a favorite genome in the lab that already has a bunch of experiments mapped to it?  Use that one.
    2. Do any of your collaborators have a favorite genome?
    3. Use the latest stable release - I would recommend using genomes curated at UCSC so that you can easily visualize your data later using the UCSC Genome Browser.  (i.e. mm9, hg18)

Q: I'm changing genome versions, can I just "liftover" my data using UCSC liftover tool, or do I need to remap it to the new genome version? 

If you want to do it right, you need to remap it.  This is because some regions of the genome that are considered "unique" in one version may suddenly be found multiple times in the new version and vice versa, so using the liftover tool will yield different results from remapping.  However, liftover is fine if you're looking for a quick and dirty solution. If you fell like cheating, as Chuck often does, try convertCoordinates.pl. - it's a wrapper that uses the "liftOver" program to migrate peak files and whole Tag Directores.

Should I trim my reads when mapping to the genome?

Depends.  In the old days, the read quality dropped off quite a bit past ~30 bp, but these days even the end of sequencing reads are pretty high quality.  In the end, I would recommend mapping ~32 bp reads with up the 3 mismatches, using only the uniquely alignable reads for downstream analysis.  That will give you access to probably 80-90% of what is interesting in your data set.

I have barcodes and/or adpater sequences in my reads.  Should I remove them first or just map them?

You should definitely remove the adapter sequences or other "non-biological" sequences before mapping.  Various tools can accomplish this.  You can check out homerTools for trimming sequences and dealing with adaptersGalaxy also has a nice variety of tools for accomplishing this type of stuff.

Example - Alignment with bowtie:

Step 1 - Build Index (takes a while, but only do this once):

After installing bowtie, the reference genome must first be "indexed" so that reads may be quickly aligned.  You can download pre-made indices from the bowtie website (check for those here first).  Otherwise, to perform make your own from FASTA files, do the following:
  1. Download FASTA files for the unmasked genome of interest if you haven't already (i.e. from UCSC)
  2. From the directory containing the FASTA files, run the "bowtie-build" command.  For example, for hg18:
    • /path-to-bowtie-programs/bowtie-build chr1.fa,chr2.fa,chr3.fa,...chrY.fa,chrM.fa hg18
    • Where ... are the rest of the *.fa files.  This command will take a long time to run, but will produce several files named hg18.*.ebwt
  3. Copy the *.ebwt files to the bowtie indexes directory so that bowtie knows where to find them later:
    • cp *.ebwt /path-to-bowtie-programs/indexes/

Step 2 - Align sequences with bowtie (perform for each experiment):

The most common output format for high-throughput sequencing is FASTQ format, which contains information about the sequence (A,C,G,Ts) and quality information which describes how certain the sequencer is of the base calls that were made.  In the case of Illumina sequencing, the output is usually a "s_1_sequence.txt" file.  In addition, much of the data available in the SRA, the primary archive of high-throughput sequencing data, is in this format.  To map this data, run the following command:

/path-to-bowtie-programs/bowtie -q --best -m 1 -p <# cpu> <genome> <fastq file> <output filename>
Where <genome> would be hg18 from the index made above, <fastq file> could be "s_1_sequence.txt", and <output filename> something like "s_1_sequence.hg18.alignment.txt"

The parameters "--best" and "-m 1" are needed to make sure bowtie outputs only unique alignments.  There are many options and many different ways to perform alignments, with different trade-offs for different types of projects - well beyond the scope of what I am describing here.

NOTE: HOMER contains automated parsing for uniquely aligned reads from output files generated with bowtie in this fashion.  Homer also accepts *eland_result.txt and *_export.txt formats from the Illumina pipeline.  If different programs are used, or special parsing of output files are needed, please parse/reformat alignment files to general BED format, which is also accepted by HOMER.   HOMER also accepts SAM formatted file.  If using BAM files, use "samtools view input.bam > output.sam" to convert to a SAM file.

Can't figure something out? Questions, comments, concerns, or other feedback: