CCB » CBCC » Quake

Manual

0. Quick Start

For a microbial sized genome, the easiest way to run Quake is with the Python script quake.py. Create a file listing the fastq files that you want to correct, one file per line for nonpaired reads and two files per line separated by whitespace for paired reads.

Command:
quake.py -f [fastq file list] -k [k-mer size] -p 4

For descriptions of additional options, type quake.py -h.

Alternatively, if you encounter problems or are analyzing a large dataset, you can run the 3 steps individually as described below.

1. Count q-mers

Included in the package is a program called count-qmers, which is a variation of standard k-mer counting that computes an approximation of the expected count of each k-mer using the read quality values. count-qmers is single-threaded which is sufficient for microbial sized genomes, but ill-suited for large genomes. Therefore, we recommend using additional software called JELLYFISH, an extremely fast, parallel k-mer counter, which Quake now looks for as the default. Alternatively, you could run on a Hadoop cluster. We include a program reduce-qmers to aid in doing so. If you have your own favorite k-mer counting program, feel free to plug it in here by having it produce an output file that lists "kmer \t count" one per line.

Command:
cat [fastq files] | count-qmers -k [k-mer size] > [counts file]

Type count-qmers -h for additional options.

2. Coverage cutoff

Next, using a histogram of the k-mer counts, we must find the coverage at which we will differentiate true and error k-mers. cov_model.py will make the histogram and call an R script cov_model_qmer.r that will output the threshold. Note that this R script requires the library VGAM which can be installed with the install.packages command in R. If you counted k-mers rather than q-mers, use the option --int. The --ratio options specifies how conservative we should be in choosing the cutoff. The default is 200 which says to choose the cutoff such that the probability that the coverage came from the error distribution is 200 times greater than the probability that the coverage came from the true distribution. Generally we set it >> 1 because misclassifying true k-mers as errors leads to reads from low coverage areas of the genome being removed.

Command:
cov_model.py [counts file]

Type cov_model.py -h for additional options.

3. Correct reads

To correct reads and output a set of corrected read fastq files, run the C++ program correct. Create a file listing the fastq files that you want to correct, one file per line for nonpaired reads and two files per line separated by whitespace for paired reads. The program uses openMP to parallelize the computation, and the -p option specifies the number of threads to use. Trusted k-mers are stored in the Boost library's dynamic_bitset (so you need Boost to compile the program), which requires 4^k bits- 128 Mb for k=15 but 32 Gb for k=19.

Command:
correct -f [fastq list file] -k [k-mer size] -c [cutoff] -m [counts file] -p 4

Type correct -h for additional options.

4. Output

Running Quake on paired ends read files A.fastq and B.fastq will produce the following output in which files in rows maintain the read pairs.

A.cor.fastq B.cor.fastq Validated and corrected reads
A.cor_single.fastq B.err_single.fastq Validated and corrected reads from A, error reads from B
A.err_single.fastq B.cor_single.fastq Validated and corrected reads from B, error reads from A
A.err.fastq B.err.fastq Error reads