About Kraken
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.
In its fastest mode of operation, for a simulated metagenome of 100 bp reads, Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken's accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision.
Kraken is written in C++ and Perl, and is designed for use with the Linux operating system. We have also successfully compiled and run it under the Mac OS.
2022/09/29 Update: As of September 29, Kraken 1 is no longer supported.
Please use KrakenUniq or Kraken 2.
For guidance on which software version to choose, see
Choosing a Metagenomics Classification Tool.
- Kraken 1 remains available via the Kraken 1 Github page.
- KrakenUniq is an improved version of Kraken1, with the same ultra-low false-positive (FP) rate, which adds features described in a newer paper, here, and on the KrakenUniq Github page.
- Kraken 2 is a newer implementation of Kraken that uses much less memory with a higher FP rate than Kraken 1/KrakenUniq. Kraken 2 now also includes the kmer-counting features of KrakenUniq. (see Kraken 2's Webpage for additional details).
Downloads and Documents
- Kraken 2 source code release
The current version of Kraken (v2) can be found in its GitHub repository. - The previous version of Kraken (v1) is still available in its own repository.
Note: the databases below were built for Kraken v1
- MiniKraken DB_4GB (2.9 GB):
A pre-built 4 GB
database constructed from complete bacterial, archaeal, and viral
genomes in RefSeq (as of Oct. 18, 2017). This can be used by users without
the computational resources needed to build a Kraken database.
However this contains only 2.7% of kmers from the original database.
- DustMasked MiniKraken DB 4GB (2.9 GB): This 4GB database constructed from dustmasked bacterial, archaeal, and viral genomes in Refseq as of Oct. 18, 2017.
- Bracken files for this database can be found at https://ccb.jhu.edu/software/bracken/
- seqid2taxid.map (11 MB)
- MiniKraken DB_8GB (6.0 GB): A pre-built 8 GB
database constructed from complete bacterial, archaeal, and viral
genomes in RefSeq (as of Oct. 18, 2017). This can be used by users without
the computational resources needed to build a Kraken database. This contains
around 5% of kmers from the original standard database.
- DustMasked MiniKraken DB 8GB (6.0 GB): This 8GB database constructed from dustmasked bacterial, archaeal, and viral genomes in Refseq as of Oct. 18, 2017.
- Bracken files for this database can be found at https://ccb.jhu.edu/software/bracken/
- seqid2taxid.map (11 MB)
- Kraken's operating manual (html). Please use this guide for installing and running Kraken.
- Accuracy data (1.8 MB): The data used to evaluate the accuracy of Kraken (and other classifiers); contains three FASTA files and instructions for obtaining the source taxonomic IDs for each sequence. These simulated metagenomic samples (each 10,000 reads) were also used to evaluate the speed of the non-Kraken classifiers.
- Timing data (1.8 GB): The data used to evaluate the speed of Kraken (and MetaPhlAn); contains three FASTA files, each containing 10,000,000 reads.
Accuracy and speed
Although we tested Kraken on real sequence data from isolated genomes, the biggest challenge for an exact alignment approach is that of maintaining sensitivity in the face of high divergence from the training data (in this case, Kraken's genomic library). To address the concern of Kraken's sensitivity with such sequences, we created a simulated metagenomic dataset containing simulated 100 bp reads with high sequencing error (2.1% SNP rate, 1.1% indel rate). Below are the results of using various classifiers on this dataset, with accuracy evaluated on a per-read basis (these results used January 2013 data to build each classifier's reference library):
Classifier | Genus precision |
Genus sensitivity |
Speed (reads/min) |
---|---|---|---|
Naïve Bayes Classifier | 97.64 | 97.64 | 7 |
PhymmBL | 96.11 | 96.11 | 76 |
PhymmBL (conf. > 0.65) | 99.08 | 95.45 | 76 |
Megablast w/ best hit | 96.93 | 93.67 | 4511 |
Kraken | 99.90 | 91.25 | 1307161 |
Kraken (quick operation) | 99.92 | 89.54 | 4101162 |
MiniKraken 2014 (Kraken w/ 4GB DB) | 99.95 | 65.87 | 1441476 |
MiniKraken 2014 (quick operation) | 99.98 | 65.31 | 2693119 |
MetaPhlAn | n/a | n/a | 370770 |
Removing low-complexity sequences
When analyzing a metagenomics sample using a large Kraken database -- including the standard DB described in the manual -- the primary source of false positive hits is low-complexity sequences in the genomes themselves; e.g., a string of 31 or more consecutive A's. These can largely be eliminated by first running the 'dust' program on all genomes and then building the database from these 'dusted' genomes. We strongly recommend running this program, which requires a custom database build, as described in the manual. DUST is included with the BLAST program from NCBI and is described in Morgulis et al. 2006 (www.ncbi.nlm.nih.gov/pubmed/16796549).
Kraken and other tools
Bracken allows users to perform abundance estimation with Kraken results. Bracken uses a bayesian formula to estimate species/genus-level abundance from Kraken classification results.
Pavian has also been developed as a comprehensive visualization program that can compare Kraken classifications across multiple samples.
KrakenTools is a suite of scripts designed to assist with downstream analysis of Kraken results. KrakenTools is an ongoing project led by Jennifer Lu
Reference
Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46.