News and updates
|New releases and related tools will be announced through the Bowtie mailing list.|
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.
Kim D and Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biology 2011, 12:R72
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. . Genome Biology 2013, 14:R36
Frequently Asked Questions
- How to control the alignment of reads in terms of number of mismatches, gap length etc. ?
- How can I maximize the accuracy of spliced mapping in TopHat?
- I don't know the mate inner distance (-r/--mate-inner-dist option) for my paired reads, what value should I use?
- I am not sure which library type to use (fr-firststrand or fr-secondstrand), what should I do?
- What should I do if I see a message like "Too many open files"?
How to control the alignment of reads in terms of number of mismatches, gap length etc. ?
You can use three options: --read-mismatches, --read-gap-length and --read-edit-dist. For instance, if you want read alignments with at most 2 base mismatches and no gaps then you can specify:
--read-mismatches 2 --read-gap-length 0 --read-edit-dist 2
Or if you want read alignments with total length of indels (alignment gaps) of at most 3bp and at most 2 base mismatches you can use these options:
--read-mismatches 2 --read-gap-length 3 --read-edit-dist 3
How can I maximize the accuracy of spliced mapping in TopHat?
Based on real RNA-seq samples we found out that in the genome mapping step of TopHat a high portion of reads spanning several exons can incorrectly be aligned to processed pseudogenes that are rarely (if any) transcribed or expressed, instead of the genes where they originate from. You can use either of the options below to improve the accuracy of spliced mapping in TopHat:
- If a good gene annotation is available (as the
case with the human genome), use it with the -G option.
- For poorly annotated genomes you might want to consider using the "--read-realign-edit-dist 0" option.
With the realignment option
users can choose to remap some (or all) of the mapped reads with mapping edit distance
equal to or above user-specified "remapping" edit distance (see --read-realign-edit-dist option). Setting "--read-realign-edit-dist 0"
will map every read against transcriptome, genome, and splice variants (or splice junctions)
that are detected by TopHat, no matter whether it is mapped or not in any mapping step. With this
remapping strategy, this "pseudogene" problem can be effectively
handled. If you use a genome that has processed pseudogenes
and you cannot provide good gene annotation to TopHat, you may want to consider using
this option for accurate mapping results.
I don't know the mate inner distance (-r/--mate-inner-dist) for my paired reads, what value should I use?
default value should work fine in most cases, for typical RNA-Seq PE
experiments, because TopHat allows some variance for this distance
TopHat makes use of the mate inner distance information in several
places - for instance, when finding splice sites and fusion break
points. This information is also taken into account when choosing the
best candidate alignments for paired reads in the final stage of TopHat
(tophat_reports). If you want to find a good approximation of this
distance for your reads you can try running Bowtie2 on a small sample
(subset) of the paired reads (both mates)
and taking a look at their mapped positions (we hope to add this
automatic fragment length detection in a future version of TopHat). The
SAM output of Bowtie2 for paired reads is especially helpful as the 9th
field in the SAM alignment lines should show the estimated fragment
length, from which you should subtract twice the read length to get the
value of the "inner distance" that can be used with the -r parameter
(obviously large absolute values for that field should be ignored as for
this estimate we only want to consider mates aligned to the same exon).
I am not sure which library type to use (fr-firststrand or fr-secondstrand), what should I do?
One possible way to figure out the correct library-type is to run TopHat with a small subset of the reads (e.g., 1M) as follows.
- run TopHat with fr-firststrand and count the number of junctions in junctions.bed (one of the output files from TopHat)
- run TopHat with fr-secondstrand and count the number of junctions in junctions.bed
Since the splice junction finding algorithm of TopHat makes use of library-type information (if provided), one of the two TopHat runs would result in many more splice junctions than the other one. You can then use the library type that gives more junctions. If this is not the case TopHat might not work well with your sequencing protocol. Please let us know more details about your protocol so we can add support for new library types.
What should I do if I see a message like "Too many open files"?
This usually happens when using "-p" option with a large value (many threads). TopHat may produce many intermediate files, the number of which is proportional to this value; sometimes the number of the files may go over the maximum number of files a process is allowed to open. The solution is to raise the limit to a higher number (e.g. 10000). For Mac, you can change this using a command, "sudo sysctl -w kern.maxfiles=10240".