Site Map

Getting started

Downloading annotations
Running TopHat-Fusion
Examining your results

Downloading annotations

TopHat-Fusion requires several files

Download Bowtie indexes (either Bowtie1 or Bowtie2), refGene.txt, and ensGene.txt from the links provided in the table below and extract them if necessary.
Download human_genomic*, other_genomic*, and nt* from blast database (see the tutorial below for more details).

Organism	Data source	Version	Bowtie indexes	Annotation
Homo sapiens	UCSC	hg18	Bowtie1 / Bowtie2	refGene.txt / ensGene.txt
		hg19	Bowtie1 / Bowtie2	refGene.txt / ensGene.txt
		hg38		refGene.txt / ensGene.txt
Mus musculus	UCSC	mm9	Bowtie1 / Bowtie2	refGene.txt / ensGene.txt
Rattus norvegicus	UCSC	rn4	Bowtie1	refGene.txt / ensGene.txt
Canis familiaris	UCSC	canFam2	Bowtie1	refGene.txt / ensGene.txt
Gallus gallus	UCSC	galGal3	Bowtie1	refGene.txt / ensGene.txt
Oryctolagus cuniculus	UCSC	oryCun2	Bowtie1	refGene.txt / ensGene.txt

Running TopHat-Fusion

TopHat-Fusion is built on TopHat so that it inherits every option and output formats from TopHat (Refer to the TopHat website for installation and basic information). TopHat-Fusion algorithm is described in our poster at the CSHL Biology of Genomes conference. TopHat-Fusion consists of two sub-programs (tophat and tophat-fusion-post). Using a breast cancer cell MCF7 RNA-Seq data from Edgren et al (Genome Biology 2011). , the following tutorial demonstrates how to use TopHat-Fusion to identify fusion genes including three known fusions (BCAS4-BCAS3, ARFGEF2-SULF2, RPS6KB1-TMEM49).

MCF7 RNA-Seq data (paired-end 50-bp) along with BT474, SKBR3, and KPL4 can be downloaded from the table below.

Sample	Reads	-r and --mate-std-dev values
BT474	BT474_mix	-r 50 --mate-std-dev 80
SKBR3	SKBR3_mix	-r 50 --mate-std-dev 80
KPL4	SRR064287	-r 0 --mate-std-dev 80
MCF7	SRR064286	-r 0 --mate-std-dev 80

To run TopHat-Fusion:

tophat -o tophat_MCF7 -p 8 --fusion-search --keep-fasta-order --bowtie1 --no-coverage-search -r 0 --mate-std-dev 80 --max-intron-length 100000 --fusion-min-dist 100000 --fusion-anchor-length 13 --fusion-ignore-chromosomes chrM /path/to/h_sapiens/bowtie_index SRR064286_1.fastq SRR064286_2.fastq

Make (top_dir) directory and run the above command under (top_dir) - see required directory structure. If you have multiple samples, you can run them under (top_dir).

Use tophat_(sample_name) for the output directory ("-o" option) such as tophat_MCF7. The directory name (MCF7) will be used later for annotation purposes.

You can change the number of threads using "-p" option.

Turn on fusion algorithm (--fusion-search) and use Bowtie1 (--bowtie1).

Turn off "coverage-search", which takes lots of memory and is slow.

The mean fragment length of the data is 100-bp, so the inner mate distance is 0 (= 100 - 50 * 2). In this example, We use a larger standard derivation (80-bp) for inner mate distance because TopHat-Fusion makes use of the region (mate_inner_dist ± std_dev) to discover fusions.

In addition to inter-chromosomal fusions, TopHat-Fusion tries to identify intra-chromosomal fusions due to rearrangement within a chromosome separated by at least --fusion-min-dist.

A read supports a fusion if a read maps to both sides of a fusion by at least --fusion-anchor-length.

In addition to outputs from TopHat, TopHat-Fusion outputs a list of potential fusions (fusions.out - the first 2,000 out of 68,168 fusions) and a modified SAM alignment that allows "fusion" alignment using 'F' CIGAR operator although it is not supported by SAM tools.

tophat-fusion-post -p 8 --num-fusion-reads 1 --num-fusion-pairs 2 --num-fusion-both 5 /path/to/h_sapiens/bowtie_index

TopHat-Fusion uses BLAST search results for filtering out false fusions due to highly similar sequences or pseudogenes. Also, the search results can be alternatively used for annotating purposes in case there is no known genes in the provided annotation files. 50-bp sequence on the left side of a fusion and 50-bp on the right side are combined to make a 100-bp sequence, which in turn is BLASTed against the blast database. If match length (range: 0 to 100) + identity percent (0 to 100) is greater than 160, the fusion is filtered out. This BLAST step is usually done for a few hundreds of fusions after prior filtering steps. Thus, it is highly recommended to install BLAST and download blast database as follows.

Install BLAST binaries (blastall and blastn).

Make (top_dir)/blast directory, download human_genomic*, other_genomic*, and nt* from blast database, and extract them under (top_dir)/blast.

Use --non-human option for genomes other than the human genome.

The final list of fusion candidates is given in (top_dir)/tophatfusion_out/result.html.

You may want to repeat the filtering process with various filtering parameters such as --num-fusion-reads and --num-fusion-pairs without deleting (top_dir)/tophatfusion_out, which is a database tophat-fusion-post internally uses for fast computation.
This program requires Bowtie1 and the index files for Bowtie1, as it uses Bowtie1 internally mostly for filtering purposes.

Required directory structure (top_dir) tophat_sample_1 - the output directory by tophat and you may want to run it on several samples. tophat_sample_2 ... tophat_sample_n tophatfusion_out - the output directory by tophat-fusion-post ensGene.txt refGene.txt blast - BLAST database Examining your output TopHat-Fusion produces several files. This research was supported in part by NIH grants R01-LM06845 and R01-GM083873. Administrator: Daehwan Kim. Design by David Herreman

TopHat-Fusion

An algorithm for Discovery of Novel Fusion Transcripts

Site Map

Getting started

Downloading annotations

Running TopHat-Fusion

Examining your output