Rascaf - improving genome assembly with RNA-seq data
Rascaf (RnA-seq SCAFfolder) is a fast and
efficient tool that leverages the long-range continuity information from
intron-spanning RNA-seq read pairs to detect new contig connections and
improve the assembly, in particularly in the gene regions.
Rascaf is described in:
Song, L., D. Shankar, L. Florea (2016). Rascaf: Improving genome assembly with RNA-seq data, The Plant Genome, doi: 10.3835/plantgenome2016.03.0027, In press [Full text]
Download Rascaf
here. Genome data for the Fragaria genomes (F. iinumae, F. orientalis, F. nipponica, F. nubicola and F. ananassa) after application of Rascaf can be downloaded here.
What is Rascaf?
Rascaf uses continuity and order information from
paired-end RNA-seq reads to improve a draft assembly, particularly in the
gene regions. It takes as input an assembly and one or several RNA-seq
data sets aligned to the genome and recruits additional contigs into
the assembly, potentially adjusting some scaffolds to better fit the data
and to create longer gene models. Rascaf works in three stages.
In stage 1, implemented in the executable rascaf, it
computes a set of candidate contig connections from the raw (original)
assembly that are supported by the RNA-seq data. The
user can choose to validate and filter the connections
by searching the merged gene sequences against public sequence
databases, in an optional stage 2. Finally, in stage 3, Rascaf uses these connections to select and/or
re-arrange additional contigs within scaffolds and chromosomes, an algorithm
implemented in the executable rascaf-join. When
run with multiple RNA-seq data sets, the program first generates a set
of connections for each set independently, and then reconciles all
connections during a 'join' step that detects and resolves any conflicts.
Download and installation procedure
Clone the GitHub repository, for instance:
git clone https://github.com/mourisl/rascaf.git
Follow the instructions in the README file for compiling.
A guide to using Rascaf's command line options
Rascaf is comprised of two executable files, rascaf and
rascaf-join. Rascaf identifies the connections from a single RNA-seq
data set. Rascaf-join uses the connections found by rascaf to build
the scaffolds and, if applicable, to combine different data sets.
SYNOPSIS
rascaf [-OPTIONS]
OPTIONS
-b STRING path to the BAM file for the alignment (required) -f STRING path to the raw assembly fasta file (recommended) -o STRING prefix of the output file (default: rascaf) -ms INT minimum support for connecting two contigs(default: 2) -ml INT minimum exonic length if no intron (default: 200) -k INT size of k-mer (≤ 32; default: 21) -cs output the genomic sequence involved in connections (default: not used) -v verbose mode (default: false)
SYNOPSIS
rascaf-join [-OPTIONS]
OPTIONS
-r STRING path to the rascaf connection file. Can use multiple -r to specify multiple connection files (required) -o STRING prefix of the output file (default: rascaf_scaffold) -ms INT minimum support alignments for the connection (default: 2) -ignoreGap ignore the gap size, which do not consider the number of Ns between contigs (default: not used)
EXAMPLE
Input files A.bam and B.bam contain RNA-seq alignments, and the raw assembly is in file assembly.fa. The new scaffolds will be reported in the file assembly_scaffold.fa.
./rascaf -b A.bam -f assembly.fa -o A ./rascaf -b B.bam -f assembly.fa -o B ./rascaf-join -r A.out -r B.out -o assembly_scaffold
This work was supported in part by NSF grant IOS-1339134 to Liliana Florea. |