Rascaf (RnA-seq SCAFfolder) is a fast and efficient tool that leverages the long-range continuity information from intron-spanning RNA-seq read pairs to detect new contig connections and improve the assembly, in particularly in the gene regions.
Rascaf is described in:
Song, L., D. Shankar, L. Florea (2016). Rascaf: Improving genome assembly with RNA-seq data, The Plant Genome, doi: 10.3835/plantgenome2016.03.0027, In press [Full text]
Rascaf uses continuity and order information from paired-end RNA-seq reads to improve a draft assembly, particularly in the gene regions. It takes as input an assembly and one or several RNA-seq data sets aligned to the genome and recruits additional contigs into the assembly, potentially adjusting some scaffolds to better fit the data and to create longer gene models. Rascaf works in three stages. In stage 1, implemented in the executable rascaf, it computes a set of candidate contig connections from the raw (original) assembly that are supported by the RNA-seq data. The user can choose to validate and filter the connections by searching the merged gene sequences against public sequence databases, in an optional stage 2. Finally, in stage 3, Rascaf uses these connections to select and/or re-arrange additional contigs within scaffolds and chromosomes, an algorithm implemented in the executable rascaf-join. When run with multiple RNA-seq data sets, the program first generates a set of connections for each set independently, and then reconciles all connections during a 'join' step that detects and resolves any conflicts.
Clone the GitHub repository, for instance:
git clone https://github.com/mourisl/rascaf.git
Follow the instructions in the README file for compiling.
Rascaf is comprised of two executable files, rascaf and rascaf-join. Rascaf identifies the connections from a single RNA-seq data set. Rascaf-join uses the connections found by rascaf to build the scaffolds and, if applicable, to combine different data sets.
-b STRING path to the BAM file for the alignment (required) -f STRING path to the raw assembly fasta file (recommended) -o STRING prefix of the output file (default: rascaf) -ms INT minimum support for connecting two contigs(default: 2) -ml INT minimum exonic length if no intron (default: 200) -k INT size of k-mer (≤ 32; default: 21) -cs output the genomic sequence involved in connections (default: not used) -v verbose mode (default: false)
-r STRING path to the rascaf connection file. Can use multiple -r to specify multiple connection files (required) -o STRING prefix of the output file (default: rascaf_scaffold) -ms INT minimum support alignments for the connection (default: 2) -ignoreGap ignore the gap size, which do not consider the number of Ns between contigs (default: not used)
Input files A.bam and B.bam contain RNA-seq alignments, and the raw assembly is in file assembly.fa. The new scaffolds will be reported in the file assembly_scaffold.fa.
./rascaf -b A.bam -f assembly.fa -o A ./rascaf -b B.bam -f assembly.fa -o B ./rascaf-join -r A.out -r B.out -o assembly_scaffold
This work was supported in part by NSF grant IOS-1339134 to Liliana Florea.