CCB » CBCC » Bracken

Abundance Estimation Overview

Abundance estimation with Bracken relies on the following three proprocessing steps:

  1. Kraken Classification of DNA Reads
  2. Classifying a Full Kraken Database
  3. Generation of Kmer Distribution File

Step #1 must be completed for each dataset prior to abundance estimation. Bracken relies on the Kraken classification of reads and the generation of the corresponding kraken-report file. The kraken-report file produced is the input file used by Bracken.
Steps #2 and #3 are completed once for each Kraken database used in classification. If multiple datasets are classified against the same Kraken databse, the generated kmer distribution file from Step #3 can be reused between datasets.

Classifying a Full Kraken Database

Bracken relies on bayesian probabilities that derive from knowledge about the Kraken classification of each read-length kmer from all genomes within the Kraken database. Therefore, prior to abundance estimation with bracken, we must divide each genome in the Kraken database into read-length kmers and classify each of those kmers.

For this step, it is assumed that ${KRAKEN_DB} is the path to a built Kraken database, and all sequences are as '.fna' files in the ${KRAKEN_DB}/library directory.

Step #2a: Search all library input sequences against the database using:

kraken --db=${KRAKEN_DB} --fasta-input --threads=10 <( find -L library -name "*.fna" -exec cat {} + ) > database.kraken


Step #2b: Compute the classifications for each perfect read of ${READ_LENGTH} base pairs from one of the input sequences.

perl count-kmer-abundances.pl --db=${KRAKEN_DB} --read-length=${READ_LENGTH} --threads=10 database.kraken > database75mers.kraken_cnts

Generation of Kmer Distribution File

A preprocessing step converts the Kraken classification of the Kraken database read-length kmers into a kmer distribution file. For this step run the following from the command line:

python generate_kmer_distribution.py -i INPUT_FILE.TXT -o OUTPUT_FILE.TXT

To view the help menu, run the following from the command line:

python generate_kmer_distribution.py --help

Arguments (Required):

  • INPUT_FILE.TXT :: the kraken counts file for all genomes classified against the kraken database.
  • OUTPUT_FILE.TXT :: the desired name of the output file to be generated by the code.

See File Formats for the required format of each input/output file

Abundance Estimation

Run Bracken with the following from the command line:

python estimate_abundance.py -i KRAKEN.REPORT -k KMER_DISTR.TXT -o OUTPUT_FILE.TXT [-l CLASSIFICATION_LEVEL -t THRESHOLD]

To view the help menu, run the following from the command line:

python estimate_abundance.py --help

Arguments (Required):

  • KRAKEN.REPORT:: the kraken report generated for a given dataset
  • KMER_DISTR.TXT:: the file generated by generate_kmer_distribution.py
  • OUTPUT_FILE.TXT:: the desired name of the output file to be generated by the code.

See File Formats for the required format of each input/output file

Additional Options:

  • CLASSIFICATION_LEVEL [Default = 'S', Options = 'K','P','C','O','F','G','S']:: specifies the taxonomic rank to analyze. Each classification at this specified rank will receive an estimated number of reads belonging to that rank after abundance estimation.
  • THRESHOLD [Default = 10]:: specifies the minimum number of reads required for a classification at the specified rank. Any classifications with less than the specified threshold will not receive additional reads from higher taxonomy levels when distributing reads for abundance estimation.

File Formats

Kraken Report File Format:
(bracken input file; tab-delimited columns)

  • Percentage of reads for the subtree rooted at this taxon
  • Total number of reads for the subtree rooted at this taxon
  • Number of reads assigned directly to this taxon level
  • Taxonomical Rank [U, -, D, P, C, O, F, G, S]
  • NCBI Taxonomy ID
  • Scientific Name - Indented (2 spaces per level)

[Note: The Kraken report format is specified in more detail in the Kraken README]

Kraken Counts File Format:
(generate_kmer_distribution.py input file; tab-delimited columns)

  • Identification string for the read
  • NCBI Taxonomy ID of the genoome being classified
  • NCBI Taxonomy ID of the classification Kraken assigned to the full read
  • Sequence length of the read
  • NCBI Taxonomy ID of the classification Kraken assigned to each kmer within the read

Kmer Distribution File Format:
(generate_kmer_distribution.py output file; bracken input file; tab-delimited columns)

  • NCBI Taxonomy ID for each Taxon Classification
  • Space-delimited list of genomes with kmers classified at this taxon.
    For each genome, the following information is listed, separated by colons ":"
    • NCBI Taxonomy ID for each genome
    • Number of kmers from the genome that are classified at this taxon
    • Total number of kmers in the genome

Bracken Output File Format:
(bracken output file; tab-delimited columns)

  • Name
  • Taxonomy ID
  • Level ID (S=Species, G=Genus, O=Order, F=Family, P=Phylum, K=Kingdom)
  • Kraken Assigned Reads
  • Added Reads with Abundance Reestimation
  • Total Reads after Abundance Reestimation
  • Fraction of Total Reads