Rascaf

Rascaf - improving genome assembly with RNA-seq data

Rascaf (RnA-seq SCAFfolder) is a fast and efficient tool that leverages the long-range continuity information from intron-spanning RNA-seq read pairs to detect new contig connections and improve the assembly, in particularly in the gene regions.

Rascaf is described in:
Song, L., D. Shankar, L. Florea (2016). Rascaf: Improving genome assembly with RNA-seq data, The Plant Genome, doi: 10.3835/plantgenome2016.03.0027, In press [Full text]

What is Rascaf?
Download and installation procedures
A guide to using Rascaf's command line options

Download Rascaf here. Genome data for the Fragaria genomes (F. iinumae, F. orientalis, F. nipponica, F. nubicola and F. ananassa) after application of Rascaf can be downloaded here.

What is Rascaf?

Rascaf uses continuity and order information from paired-end RNA-seq reads to improve a draft assembly, particularly in the gene regions. It takes as input an assembly and one or several RNA-seq data sets aligned to the genome and recruits additional contigs into the assembly, potentially adjusting some scaffolds to better fit the data and to create longer gene models. Rascaf works in three stages. In stage 1, implemented in the executable rascaf, it computes a set of candidate contig connections from the raw (original) assembly that are supported by the RNA-seq data. The user can choose to validate and filter the connections by searching the merged gene sequences against public sequence databases, in an optional stage 2. Finally, in stage 3, Rascaf uses these connections to select and/or re-arrange additional contigs within scaffolds and chromosomes, an algorithm implemented in the executable rascaf-join. When run with multiple RNA-seq data sets, the program first generates a set of connections for each set independently, and then reconciles all connections during a 'join' step that detects and resolves any conflicts.

Download and installation procedure

Clone the GitHub repository, for instance:

git clone https://github.com/mourisl/rascaf.git

Follow the instructions in the README file for compiling.

A guide to using Rascaf's command line options

Rascaf is comprised of two executable files, rascaf and rascaf-join. Rascaf identifies the connections from a single RNA-seq data set. Rascaf-join uses the connections found by rascaf to build the scaffolds and, if applicable, to combine different data sets.

SYNOPSIS

rascaf [-OPTIONS]

OPTIONS

-b STRING path to the BAM file for the alignment (required)

-f STRING path to the raw assembly fasta file (recommended)

-o STRING prefix of the output file (default: rascaf)

-ms INT minimum support for connecting two contigs(default: 2)

-ml INT minimum exonic length if no intron (default: 200)

-k INT size of k-mer (≤ 32; default: 21)

-cs output the genomic sequence involved in connections (default: not used)

-v verbose mode (default: false)

SYNOPSIS

rascaf-join [-OPTIONS]

OPTIONS

-r STRING path to the rascaf connection file. Can use multiple -r to specify multiple connection files (required)
-o STRING prefix of the output file (default: rascaf_scaffold)
-ms INT minimum support alignments for the connection (default: 2)
-ignoreGap ignore the gap size, which do not consider the number of Ns between contigs (default: not used)

EXAMPLE

Input files A.bam and B.bam contain RNA-seq alignments, and the raw assembly is in file assembly.fa. The new scaffolds will be reported in the file assembly_scaffold.fa.

./rascaf -b A.bam -f assembly.fa -o A

./rascaf -b B.bam -f assembly.fa -o B

./rascaf-join -r A.out -r B.out -o assembly_scaffold

This work was supported in part by NSF grant IOS-1339134 to Liliana Florea.

-b STRING		path to the BAM file for the alignment (required)
-f STRING		path to the raw assembly fasta file (recommended)
-o STRING		prefix of the output file (default: rascaf)
-ms INT		minimum support for connecting two contigs(default: 2)
-ml INT		minimum exonic length if no intron (default: 200)
-k INT		size of k-mer (≤ 32; default: 21)
-cs		output the genomic sequence involved in connections (default: not used)
-v		verbose mode (default: false)

-r STRING		path to the rascaf connection file. Can use multiple -r to specify multiple connection files (required)
-o STRING		prefix of the output file (default: rascaf_scaffold)
-ms INT		minimum support alignments for the connection (default: 2)
-ignoreGap		ignore the gap size, which do not consider the number of Ns between contigs (default: not used)