Site Map

News and updates

    New releases and related tools will be announced through the Bowtie mailing list.

Getting Help


Related Tools

  • Cufflinks: Isoform assembly and quantitation for RNA-Seq
  • Bowtie: Ultrafast short read alignment
  • TopHat-Fusion: An algorithm for Discovery of Novel Fusion Transcripts
  • CummeRbund: Visualization of RNA-Seq differential analysis




Frequently Asked Questions

How to control the alignment of reads in terms of number of mismatches, gap length etc. ?

You can use three options: --read-mismatches, --read-gap-length and --read-edit-dist. For instance, if you want read alignments with at most 2 base mismatches and no gaps then you can specify:
  --read-mismatches 2 --read-gap-length 0 --read-edit-dist 2
Or if you want read alignments with total length of indels (alignment gaps) of at most 3bp and at most 2 base mismatches you can use these options:
 --read-mismatches 2 --read-gap-length 3 --read-edit-dist 3

How can I maximize the accuracy of spliced mapping in TopHat?

Based on real RNA-seq samples we found out that in the genome mapping step of TopHat a high portion of reads spanning several exons can incorrectly be aligned to processed pseudogenes that are rarely (if any) transcribed or expressed, instead of the genes where they originate from. You can use either of the options below to improve the accuracy of spliced mapping in TopHat:
  • If a good gene annotation is available (as the case with the human genome), use it with the -G option.
  • For poorly annotated genomes you might want to consider using the "--read-realign-edit-dist 0" option. With the realignment option users can choose to remap some (or all) of the mapped reads with mapping edit distance equal to or above user-specified "remapping" edit distance (see --read-realign-edit-dist option). Setting "--read-realign-edit-dist 0" will map every read against transcriptome, genome, and splice variants (or splice junctions) that are detected by TopHat, no matter whether it is mapped or not in any mapping step. With this remapping strategy, this "pseudogene" problem can be effectively handled. If you use a genome that has processed pseudogenes and you cannot provide good gene annotation to TopHat, you may want to consider using this option for accurate mapping results.

I don't know the mate inner distance (-r/--mate-inner-dist) for my paired reads, what value should I use?

The default value should work fine in most cases, for typical RNA-Seq PE experiments, because TopHat allows some variance for this distance internally.  TopHat makes use of the mate inner distance information in several places - for instance, when finding splice sites and fusion break points. This information is also taken into account when choosing the best candidate alignments for paired reads in the final stage of TopHat (tophat_reports). If you want to find a good approximation of this distance for your reads you can try running Bowtie2 on a small sample (subset) of the paired reads (both mates) and  taking a look at their mapped positions (we hope to add this automatic fragment length detection in a future version of TopHat). The SAM output of Bowtie2 for paired reads is especially helpful as the 9th field in the SAM alignment lines should show the estimated fragment length, from which you should subtract twice the read length to get the value of the "inner distance" that can be used with the -r parameter (obviously large absolute values for that field should be ignored as for this estimate we only want to consider mates aligned to the same exon).

I am not sure which library type to use (fr-firststrand or fr-secondstrand), what should I do?

One possible way to figure out the correct library-type is to run TopHat with a small subset of the reads (e.g., 1M) as follows. 

  1. run TopHat with fr-firststrand and count the number of junctions in junctions.bed (one of the output files from TopHat)
  2. run TopHat with fr-secondstrand and count the number of junctions in junctions.bed 

Since the splice junction finding algorithm of TopHat makes use of library-type information (if provided), one of the two TopHat runs would result in many more splice junctions than the other one. You can then use the library type that gives more junctions. If this is not the case TopHat might not work well with your sequencing protocol. Please let us know more details about your protocol so we can add support for new library types.

What should I do if I see a message like "Too many open files"?

This usually happens when using "-p" option with a large value (many threads). TopHat may produce many intermediate files, the number of which is proportional to this value; sometimes the number of the files may go over the maximum number of files a process is allowed to open. The solution is to raise the limit to a higher number (e.g. 10000). For Mac, you can change this using a command, "sudo sysctl -w kern.maxfiles=10240".