Getting started
Downloading annotations
TopHat-Fusion requires several files
- Download Bowtie indexes (either Bowtie1 or Bowtie2), refGene.txt, and ensGene.txt from the links provided in the table below
and extract them if necessary.
- Download human_genomic*, other_genomic*, and nt* from blast database
(see the tutorial below for more details).
Running TopHat-Fusion
TopHat-Fusion is built on TopHat
so that it inherits every option and output formats from TopHat
(Refer to the TopHat website for installation and basic information).
TopHat-Fusion algorithm is described in our poster at the CSHL Biology of Genomes conference.
TopHat-Fusion consists of two sub-programs (tophat and tophat-fusion-post).
Using a breast cancer cell MCF7 RNA-Seq data from Edgren et al (Genome Biology 2011). ,
the following tutorial demonstrates how to use TopHat-Fusion to identify fusion genes
including three known fusions (BCAS4-BCAS3, ARFGEF2-SULF2, RPS6KB1-TMEM49).
MCF7 RNA-Seq data (paired-end 50-bp) along with BT474, SKBR3, and KPL4 can be downloaded from the table below.
Sample |
Reads |
-r and --mate-std-dev values |
BT474 |
BT474_mix |
-r 50 --mate-std-dev 80 |
SKBR3 |
SKBR3_mix |
-r 50 --mate-std-dev 80 |
KPL4 |
SRR064287 |
-r 0 --mate-std-dev 80 |
MCF7 |
SRR064286 |
-r 0 --mate-std-dev 80 |
To run TopHat-Fusion:
tophat -o tophat_MCF7 -p 8 --fusion-search --keep-fasta-order --bowtie1
--no-coverage-search -r 0 --mate-std-dev 80 --max-intron-length 100000
--fusion-min-dist 100000 --fusion-anchor-length 13 --fusion-ignore-chromosomes chrM
/path/to/h_sapiens/bowtie_index SRR064286_1.fastq SRR064286_2.fastq
-
Make (top_dir) directory and run the above command under (top_dir) - see required directory structure. If you have multiple samples, you can run them under (top_dir).
-
Use tophat_(sample_name) for the output directory ("-o" option) such as tophat_MCF7.
The directory name (MCF7) will be used later for annotation purposes.
- You can change the number of threads using "-p" option.
- Turn on fusion algorithm (--fusion-search) and use Bowtie1 (--bowtie1).
- Turn off "coverage-search", which takes lots of memory and is slow.
-
The mean fragment length of the data is 100-bp, so the inner mate distance is 0 (= 100 - 50 * 2).
In this example, We use a larger standard derivation (80-bp) for inner mate distance
because TopHat-Fusion makes use of the region (mate_inner_dist ± std_dev) to discover fusions.
-
In addition to inter-chromosomal fusions,
TopHat-Fusion tries to identify intra-chromosomal fusions
due to rearrangement within a chromosome separated by at least --fusion-min-dist.
-
A read supports a fusion if a read maps to both sides of a fusion by at least --fusion-anchor-length.
-
In addition to outputs from TopHat,
TopHat-Fusion outputs a list of potential fusions (fusions.out - the first 2,000 out of 68,168 fusions) and
a modified SAM alignment that allows "fusion" alignment using 'F' CIGAR operator
although it is not supported by SAM tools.
tophat-fusion-post -p 8 --num-fusion-reads 1 --num-fusion-pairs 2 --num-fusion-both 5 /path/to/h_sapiens/bowtie_index
-
TopHat-Fusion uses BLAST search results for filtering out false fusions due to highly similar sequences or pseudogenes.
Also, the search results can be alternatively used for annotating purposes in case there is no known genes in the provided annotation files.
50-bp sequence on the left side of a fusion and 50-bp on the right side are combined to make a 100-bp sequence, which in turn is BLASTed against the blast database.
If match length (range: 0 to 100) + identity percent (0 to 100) is greater than 160, the fusion is filtered out.
This BLAST step is usually done for a few hundreds of fusions after prior filtering steps.
Thus, it is highly recommended to install BLAST and download blast database as follows.
- Install BLAST binaries (blastall and blastn).
-
Make (top_dir)/blast directory,
download human_genomic*, other_genomic*, and nt* from blast database, and
extract them under (top_dir)/blast.
-
Use --non-human option for genomes other than the human genome.
-
The final list of fusion candidates is given in (top_dir)/tophatfusion_out/result.html.
-
You may want to repeat the filtering process with various filtering parameters such as --num-fusion-reads and --num-fusion-pairs
without deleting (top_dir)/tophatfusion_out, which is a database tophat-fusion-post internally uses for fast computation.
-
This program requires Bowtie1 and the index files for Bowtie1, as it uses Bowtie1 internally mostly for filtering purposes.
Required directory structure
- (top_dir)
- tophat_sample_1 - the output directory by tophat and you may want to run it on several samples.
- tophat_sample_2
- ...
- tophat_sample_n
- tophatfusion_out - the output directory by tophat-fusion-post
- ensGene.txt
- refGene.txt
- blast - BLAST database
Examining your output
TopHat-Fusion produces several files.
|