How to Choose Your Metagenomics Classification Tool

Introduction

Authors: Jennifer Lu (JL), Florian P. Breitwieser (FB), Derrick E. Wood (DW), Li Song (LS), Daehwan Kim (DK), Ben Langmead (BL), Christopher Pockrandt (CP), Steven L. Salzberg (SLS)

From 2014-2018, the Center for Computational Biology has released 4 different metagenomics classification software packages: Kraken, KrakenUniq, Kraken 2, and Centrifuge. This page is dedicated to describing:

The history of each software package
The differences between each software package
The best software package for users
Additional software provided for post-processing/analyzing classification results.

#1) Introduction
#2) Software Packages
#3) Links to Software Websites & Papers
#4) General Comparison Table
#5) How to Choose
#6) About the Authors

Page Updated: 2022/09/29 by Jennifer Lu
( jlu26 jhmi edu )

Software Packages: A Brief Description

Kraken: Kraken is the first taxonomic classification software released by these authors. When released, Kraken introduced exact k-mer matching and a novel classification algorithm to achieve higher sensitivity and dramatically greater speed than previous classification programs. Kraken 1 is no longer supported and has been replaced by KrakenUniq.
KrakenUniq: KrakenUniq, released in 2018, is based on Kraken 1, using the same databases and classification algorithm. However, while Kraken provides only the read counts, KrakenUniq also determines the k-mer coverage for each taxonomic classification. This metric is very valuable in filtering out false positive reads.
KrakenUniq is only compatible with Kraken 1 databases, not Kraken 2.

Kraken1 and KrakenUniq use very large databases, often hundreds of gigabytes in size. Until May 2022, this required a computer with RAM that could hold the entire database. In May 2022 we released a new version (v0.7+) of KrakenUniq that can now "chunk" the database into pieces that fit into the RAM of any computer, even a laptop. We strongly recommend this version, which you can install using BioConda at https://anaconda.org/bioconda/krakenuniq.
Kraken 2: Kraken 2, released in 2018, has significant memory and speed improvements while maintaining the original Kraken classification algorithm and k-mer based classification. For normal metagenomics analysis, Kraken 2 is generally superior to Kraken 1, using much less memory (RAM). It also has the unique capability to do 16S analysis. However, it has a tiny false-positive rate (erroneously assigning a read to a species) that KrakenUniq does not have. This means that for diagnosis of infections, where the goal is to identify a very small number of reads (often just a few dozen out of millions), users should use KrakenUniq.
Centrifuge: Centrifuge, released in 2016, is the first software written to address the memory issues of Kraken. Written as an entirely new software by Daewhan Kim, Li Song, and Florian P Breitwieser, Centrifuge creates significantly smaller databases based on an FM-index and compression of within-species genomes. However, Centrifuge also uses an entirely different classification algorithm (described in further detail below)
Bracken: Bracken is compatible with Kraken 1, Kraken 2, and KrakenUniq as a post- processing script.
- The tools described above are taxonomic classification programs. While these programs attempt to assign each read a specific label, some reads may match kmers shared between two distantly related taxon, causing the read to be labeled at a "higher" taxonomic level (such as Bacteria).
- For a more comprehensive picture of a sample's composition, Bracken is provided as a post-processing software, estimating abundance estimation at any desired taxonomic level (e.g. species or genus abundance estimation). This therefore allows users to see the estimated composition of their sample at a particular taxonomic level.
Pavian: All of the above software provide output files in text format per sample. Pavian was developed in 2016 to allow Kraken and Centrifuge users both visualize the classification results AND compare between samples.
KrakenTools: KrakenTools is an ongoing project developed as a set of tools designed to help pre or post-process Kraken classification information. All tools are designed to work with any of the Kraken 1, KrakenUniq, Kraken 2 or Bracken scripts.

Links to Software Websites & Papers

For the most comprehensive understanding of each software package, please refer to the individual websites and papers:

Kraken: Kraken Website & 2014 Genome Biology Paper
KrakenUniq: KrakenUniq Github & 2018 Genome Biology Paper
Kraken 2: Kraken 2 Website & 2019 Genome Biology Paper
Centrifuge: Centrifuge Website & 2016 Genome Research Paper
Bracken [Kraken Abundance Estimation]: Bracken Website & 2017 PeerJ Paper
Pavian [Classification Visualization]: Pavian Website & Pavian Preprint Paper
KrakenTools [Pre/Post-Processing Tools]: KrakenTools Website

On September 28th, 2022, a Nature Protocols paper: Metagenome analysis using the Kraken software suite was published describing how the Kraken suite (Kraken 2, KrakenUniq, Bracken, and KrakenTools) can be used for 1) microbiome analysis and 2) pathogen identification.

General Comparison

	Kraken	KrakenUniq	Kraken 2	Centrifuge
First Release Date (yyyy/mm/dd)	2014/01/04	2018/05/30	2018/06/26	2016/10/04
Latest Release Date (yyyy/mm/dd)	2017/12/05	2022/09/09	2021/09/10	2021/08/16
Paper Date	2014/03/03	2018/11/18	2019/11/28	2016/10/17
Original Authors	DW/SLS	FB/SLS	DW/JL/BL	DK/LS/FB
Currently Supported?	No	Yes, FB	Yes, DW/JL	Yes, LS
Memory^A	240.8 GB	240.8 GB	34.7 GB	25.2 GB
Database Build Time^A	16 hours	16 hours	4 hours	17 hours
Processing Time (per 10 Million reads)^A	60 sec	55 sec	13 sec	70 sec
Abundance Estimation	Bracken	Bracken	Bracken	Built-in
Supported Databases	Refseq GRCh38	Refseq GRCh38 microbial nt	Refseq GRCh38 nt 16S Greengenes 16S Silva 16S RDP nr protein (translated search)	Refseq GRCh38 nt

^A Memory and Times measured for databases containing GRCh38 and Refseq bacterial/archaeal/viral sequences downloaded in Sept 2018. Database build speed measured using 32 threads on a 48 core machine with 512 GB memory. Processing speed measured using 16 threads during classification on the same machine. Memory and speed measured using each program's defaults (including default kmer size)

How to Choose

Kraken 1 is no longer supported:

While many continue to use this software, we encourage all Kraken users to upgrade to either KrakenUniq or Kraken 2.

KrakenUniq and Kraken 2 are uniquely useful depending on the project goal:

In cases where false positives can be detrimental to the overall interpretation of the results (e.g. in pathogen identification/diagnoses), KrakenUniq is best suited to help filter false positives and validate classification.
However, in cases where users are limited by speed/memory, we suggest Kraken 2. Kraken 2 uses 6-7x less memory than Kraken 1, builds databases 4x faster than Kraken 1 and KrakenUniq and processes sample data in 5-6x less time.
Kraken 2 also provides additional support for the entire nt database, 16S RDP, Greengenes, SILVA databases, and protein databases like nr (with translated search). As of 2022, Kraken 2 also provides support for KrakenUniq kmer-counting.

Kraken 2 v Centrifuge are distinctly different, but with different advantages:

**With default settings, Centrifuge will use slightly less memory than Kraken 2. Kraken 2 relies on a probabilistic hash table for k-mers while Centrifuge uses an FM-index and within-species compression.
The classification results are also significantly different, as Centrifuge can give multiple assignments per read while Kraken 2 gives each read one taxonomic assignment.
While both rely on exact k-mer matching, Kraken 2 analyzes all k-mers of the same length in a read (35bp k-mers by default) while Centrifuge starts with a 16bp at minimum exact match and extends this as far as possible. If Centrifuge encounters a mismatch, the program then skips the base and tries to find the next exact-match in the database.
**Due to the complexities in the Centrifuge classification process, Kraken 2 has a far better classification speed than Centrifuge. Kraken 2 also requires less time for building a database.
**Finally, Kraken 2 provides additional advantages, with more accurate abundance estimation with Bracken, and support for more databases (such as 16S databases and protein databases with translated searches).

About the Authors

Jennifer Lu (JL) is a Staff Scientist at Johns Hopkins University in the Center for Computational Biology in Steven Salzberg's and Trish's labs. She maintains the Bracken and KrakenTools software packages and works alongside Derrick Wood and Ben Langmead to maintain Kraken 2. (Jennifer Lu's webpage )

Florian P Breitwieser (FB) is a former post-doctoral researcher at Johns Hopkins University in Steven Salzberg's Lab. He is one of the original authors of Centrifuge and is the author of KrakenUniq and Pavian. (Florian Breitwieser's former Hopkins webpage)

Derrick E Wood (DW) received his PhD in 2014 from his work with Steven Salzberg on Kraken at the University of Maryland. For his post-doctoral work, Derrick worked with Ben Langmead in Johns Hopkins Computer Science to develop Kraken 2. (Derrick Wood's former Hopkins webpage)

Li Song (LS) received his PhD in 2018 working with Liliana Florea at Johns Hopkins University in the Computer Science Department. He is now a post-doctoral researcher at the Dana-Farber Cancer Institute in Shirley Liu’s lab. He is one of the original authors of Centrifuge and continues to maintain and update the software.

Daewhan Kim (DK) received his PhD at the University of Maryland in Steven Salzberg's lab, and then conducted post-doctoral research with Salzberg at Johns Hopkins University, during which he developed the HISAT and HISAT2 spliced alignment programs. He wrote Centrifuge alongisde Florian Breitwieser and Li Song. He now is an Assistant Professor at the University of Texas, Southwestern Medical Cneter. (Kim Lab webpage)

Christopher Pockrandt (CP) was a postdoctoral researcher in Steven Salzberg's lab from 2019 through June of 2022. He developed and implemented the memory-chunking algorithm that allows KrakenUniq to run on low-memory computers.

Natalia Rincon (NR) is a current Ph.D. student in Biomedical Engineering in Steven Salzberg's lab. She is the author of the diversity scripts for the KrakenTools suite and is one of the co-first authors for the Kraken metagenome protocol paper.

Martin Steinegger (MS) is an Assistant Professor in the Biology Department at the Seoul National University. He is a former postdoctoral researcher in Steven Salzberg's lab. He incorporated the KrakenUniq kmer-counting features in Kraken2. He also led the effort for the Kraken Nature Protocols Paper. (Steinegger Lab webpage)

Ben Langmead (BL) is an Associate Professor at Johns Hopkins University in the Department of Computer Science. He is the primary advisor to the Kraken 2 project. (Langmead Lab webpage)

Steven L Salzberg (SLS) is the Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University. He is/was the primary advisor for the students and postdocs who developed Kraken 1, Centrifuge, KrakenUniq, Bracken, and Pavian. (Salzberg Lab webpage)