TopHat-Fusion is an enhanced version of TopHat with the ability to align reads across fusion points, which results from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome.

Open Source Software

Getting started

Downloading annotations

TopHat-Fusion requires several files

  • Download Bowtie indexes (either Bowtie1 or Bowtie2), refGene.txt, and ensGene.txt from the links provided in the table below and extract them if necessary.
  • Download human_genomic*, other_genomic*, and nt* from blast database (see the tutorial below for more details).

Organism Data source Version Bowtie indexes Annotation
Homo sapiens UCSC hg18 Bowtie1 / Bowtie2 refGene.txt / ensGene.txt
hg19 Bowtie1 / Bowtie2 refGene.txt / ensGene.txt
hg38 refGene.txt / ensGene.txt
Mus musculus UCSC mm9 Bowtie1 / Bowtie2 refGene.txt / ensGene.txt
Rattus norvegicus UCSC rn4 Bowtie1 refGene.txt / ensGene.txt
Canis familiaris UCSC canFam2 Bowtie1 refGene.txt / ensGene.txt
Gallus gallus UCSC galGal3 Bowtie1 refGene.txt / ensGene.txt
Oryctolagus cuniculus UCSC oryCun2 Bowtie1 refGene.txt / ensGene.txt

Running TopHat-Fusion

TopHat-Fusion is built on TopHat so that it inherits every option and output formats from TopHat (Refer to the TopHat website for installation and basic information). TopHat-Fusion algorithm is described in our poster at the CSHL Biology of Genomes conference. TopHat-Fusion consists of two sub-programs (tophat and tophat-fusion-post). Using a breast cancer cell MCF7 RNA-Seq data from Edgren et al (Genome Biology 2011). , the following tutorial demonstrates how to use TopHat-Fusion to identify fusion genes including three known fusions (BCAS4-BCAS3, ARFGEF2-SULF2, RPS6KB1-TMEM49).

MCF7 RNA-Seq data (paired-end 50-bp) along with BT474, SKBR3, and KPL4 can be downloaded from the table below.
Sample Reads -r and --mate-std-dev values
BT474 BT474_mix -r 50 --mate-std-dev 80
SKBR3 SKBR3_mix -r 50 --mate-std-dev 80
KPL4 SRR064287 -r 0 --mate-std-dev 80
MCF7 SRR064286 -r 0 --mate-std-dev 80

To run TopHat-Fusion:

tophat -o tophat_MCF7 -p 8 --fusion-search --keep-fasta-order --bowtie1 --no-coverage-search -r 0 --mate-std-dev 80 --max-intron-length 100000 --fusion-min-dist 100000 --fusion-anchor-length 13 --fusion-ignore-chromosomes chrM /path/to/h_sapiens/bowtie_index SRR064286_1.fastq SRR064286_2.fastq

  • Make (top_dir) directory and run the above command under (top_dir) - see required directory structure. If you have multiple samples, you can run them under (top_dir).
  • Use tophat_(sample_name) for the output directory ("-o" option) such as tophat_MCF7. The directory name (MCF7) will be used later for annotation purposes.
  • You can change the number of threads using "-p" option.
  • Turn on fusion algorithm (--fusion-search) and use Bowtie1 (--bowtie1).
  • Turn off "coverage-search", which takes lots of memory and is slow.
  • The mean fragment length of the data is 100-bp, so the inner mate distance is 0 (= 100 - 50 * 2). In this example, We use a larger standard derivation (80-bp) for inner mate distance because TopHat-Fusion makes use of the region (mate_inner_dist ± std_dev) to discover fusions.
  • In addition to inter-chromosomal fusions, TopHat-Fusion tries to identify intra-chromosomal fusions due to rearrangement within a chromosome separated by at least --fusion-min-dist.
  • A read supports a fusion if a read maps to both sides of a fusion by at least --fusion-anchor-length.
  • In addition to outputs from TopHat, TopHat-Fusion outputs a list of potential fusions (fusions.out - the first 2,000 out of 68,168 fusions) and a modified SAM alignment that allows "fusion" alignment using 'F' CIGAR operator although it is not supported by SAM tools.

tophat-fusion-post -p 8 --num-fusion-reads 1 --num-fusion-pairs 2 --num-fusion-both 5 /path/to/h_sapiens/bowtie_index

  • TopHat-Fusion uses BLAST search results for filtering out false fusions due to highly similar sequences or pseudogenes. Also, the search results can be alternatively used for annotating purposes in case there is no known genes in the provided annotation files. 50-bp sequence on the left side of a fusion and 50-bp on the right side are combined to make a 100-bp sequence, which in turn is BLASTed against the blast database. If match length (range: 0 to 100) + identity percent (0 to 100) is greater than 160, the fusion is filtered out. This BLAST step is usually done for a few hundreds of fusions after prior filtering steps. Thus, it is highly recommended to install BLAST and download blast database as follows.
    • Install BLAST binaries (blastall and blastn).
    • Make (top_dir)/blast directory, download human_genomic*, other_genomic*, and nt* from blast database, and extract them under (top_dir)/blast.
    • Use --non-human option for genomes other than the human genome.
  • The final list of fusion candidates is given in (top_dir)/tophatfusion_out/result.html.
  • You may want to repeat the filtering process with various filtering parameters such as --num-fusion-reads and --num-fusion-pairs without deleting (top_dir)/tophatfusion_out, which is a database tophat-fusion-post internally uses for fast computation.
  • This program requires Bowtie1 and the index files for Bowtie1, as it uses Bowtie1 internally mostly for filtering purposes.

Required directory structure

  • (top_dir)
    • tophat_sample_1 - the output directory by tophat and you may want to run it on several samples.
    • tophat_sample_2
    • ...
    • tophat_sample_n

    • tophatfusion_out - the output directory by tophat-fusion-post

    • ensGene.txt
    • refGene.txt
    • blast - BLAST database

Examining your output

TopHat-Fusion produces several files.