Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.
In its fastest mode of operation, for a simulated metagenome of 100 bp reads, Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken's accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision.
Kraken is written in C++ and Perl, and is designed for use with the Linux operating system.
- Source (50 KB): Kraken's source code, installer, and README. The current version of Kraken is v0.10.4-beta (released Mar. 30, 2014). The source code is also available in a GitHub repository.
- MiniKraken DB (2.7 GB): A pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq (as of Dec. 8, 2014). This can be used by users without the computational resources needed to build a Kraken database.
- README: Kraken's operating manual.
- Accuracy data (1.8 MB): The data used to evaluate the accuracy of Kraken (and other classifiers); contains three FASTA files and instructions for obtaining the source taxonomic IDs for each sequence. These simulated metagenomic samples (each 10,000 reads) were also used to evaluate the speed of the non-Kraken classifiers.
- Timing data (1.8 GB): The data used to evaluate the speed of Kraken (and MetaPhlAn); contains three FASTA files, each containing 10,000,000 reads.
Accuracy and speed
Although we tested Kraken on real sequence data from isolate genomes, the biggest challenge for an exact alignment approach is that of maintaining sensitivity in the face of high divergence from the training data (in this case, Kraken's genomic library). To address the concern of Kraken's sensitivity with such sequences, we created a simulated metagenomic dataset containing simulated 100 bp reads with high sequencing error (2.1% SNP rate, 1.1% indel rate). Below are the results of using various classifiers on this dataset, with accuracy evaluated on a per-read basis (these results used January 2013 data to build each classifier's reference library):
|Naïve Bayes Classifier||97.64||97.64||7|
|PhymmBL (conf. > 0.65)||99.08||95.45||76|
|Megablast w/ best hit||96.93||93.67||4511|
|Kraken (quick operation)||99.92||89.54||4101162|
|MiniKraken (Kraken w/ 4GB DB)||99.95||65.87||1441476|
|MiniKraken (quick operation)||99.98||65.31||2693119|
Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46.
We encourage users to share their questions and experiences with Kraken. We have established a users group (Kraken-users on Google Groups) for discussion, as well as an email address for direct questions (firstname.lastname@example.org).