CCB » CBCC » Rascaf


Rascaf - improving genome assembly with RNA-seq data


Rascaf (RnA-seq SCAFfolder) is a fast and efficient tool that leverages the long-range continuity information from intron-spanning RNA-seq read pairs to detect new contig connections and improve the assembly, in particularly in the gene regions.


Rascaf is described in:
Song, L., D. Shankar, L. Florea (2016). Rascaf: Improving genome assembly with RNA-seq data, The Plant Genome, doi: 10.3835/plantgenome2016.03.0027, In press [Full text]


Download Rascaf here. Genome data for the Fragaria genomes (F. iinumae, F. orientalis, F. nipponica, F. nubicola and F. ananassa) after application of Rascaf can be downloaded here.


What is Rascaf?


Rascaf uses continuity and order information from paired-end RNA-seq reads to improve a draft assembly, particularly in the gene regions. It takes as input an assembly and one or several RNA-seq data sets aligned to the genome and recruits additional contigs into the assembly, potentially adjusting some scaffolds to better fit the data and to create longer gene models. Rascaf works in three stages. In stage 1, implemented in the executable rascaf, it computes a set of candidate contig connections from the raw (original) assembly that are supported by the RNA-seq data. The user can choose to validate and filter the connections by searching the merged gene sequences against public sequence databases, in an optional stage 2. Finally, in stage 3, Rascaf uses these connections to select and/or re-arrange additional contigs within scaffolds and chromosomes, an algorithm implemented in the executable rascaf-join. When run with multiple RNA-seq data sets, the program first generates a set of connections for each set independently, and then reconciles all connections during a 'join' step that detects and resolves any conflicts.



Download and installation procedure


Clone the GitHub repository, for instance:

git clone https://github.com/mourisl/rascaf.git


Follow the instructions in the README file for compiling.



A guide to using Rascaf's command line options


Rascaf is comprised of two executable files, rascaf and rascaf-join. Rascaf identifies the connections from a single RNA-seq data set. Rascaf-join uses the connections found by rascaf to build the scaffolds and, if applicable, to combine different data sets.


SYNOPSIS

rascaf [-OPTIONS]


OPTIONS

-b STRING   path to the BAM file for the alignment (required)
-f STRING   path to the raw assembly fasta file (recommended)
-o STRING   prefix of the output file (default: rascaf)
-ms INT   minimum support for connecting two contigs(default: 2)
-ml INT   minimum exonic length if no intron (default: 200)
-k INT   size of k-mer (≤ 32; default: 21)
-cs   output the genomic sequence involved in connections (default: not used)
-v   verbose mode (default: false)

SYNOPSIS

rascaf-join [-OPTIONS]


OPTIONS

-r STRING   path to the rascaf connection file. Can use multiple -r to specify multiple connection files (required)
-o STRING   prefix of the output file (default: rascaf_scaffold)
-ms INT   minimum support alignments for the connection (default: 2)
-ignoreGap   ignore the gap size, which do not consider the number of Ns between contigs (default: not used)

EXAMPLE


Input files A.bam and B.bam contain RNA-seq alignments, and the raw assembly is in file assembly.fa. The new scaffolds will be reported in the file assembly_scaffold.fa.

./rascaf -b A.bam -f assembly.fa -o A
./rascaf -b B.bam -f assembly.fa -o B
./rascaf-join -r A.out -r B.out -o assembly_scaffold


NSFlogo

This work was supported in part by NSF grant IOS-1339134 to Liliana Florea.