Installation
The install_bracken.sh script is provided to ensure all scripts can be run as described. To use this script, run:
sh install_bracken.sh
Alternatively, add bracken, bracken-build and scripts in the src/ folder to your PATH. Users must also run:
cd src/ && make
Bracken Overview
Prior to calculating abundance for a given sample, the user must:
- Step 0: Build a Kraken 1.0 or Kraken 2.0 Database
- Step 1a: Classify a Full Kraken Database
- Step 1b/1c: Generate the Kmer Distribution File
Steps 1a/1b/1c can be completed by running the new bracken-build script or be run individually by the user.
The remaining steps required for abundance estimation are:
- Step 2a: Classify a Sample using Kraken 1.0/2.0
- Step 2b: Generate a Kraken Report file
- Step 3: Run Bracken/est_abundance.py
Step 0: Build a Kraken Database
Bracken is compatible with Kraken 1.0 or Kraken 2.0. The basic commands for building a database are:
kraken-build --db=${KRAKEN_DB} --threads=10
kraken2-build --db=${KRAKEN2_DB} --threads=10
For this step, it is assumed that ${KRAKEN_DB} contains both a library folder (containing all database sequences) and a taxonomy folder Additional details for building a Kraken database can be found at the following:
Step 1: Bracken-build
Bracken relies on bayesian probabilities that derive from knowledge about the Kraken classification of each read-length kmer from all genomes within the Kraken database. Therefore, prior to abundance estimation with bracken, we must divide each genome in the Kraken database into read-length kmers and classify each of those kmers.
Run either Step 1 or Steps 1a/1b/1c combined. Step 1 will perform Steps 1a/1b/1c using a single script.
Step 1: Bracken-build - combination of Steps 1a/1b/1c in a single bash script [Please note bracken/bracken-build scripts cannot accept the long format for options at the moment].
./bracken-build -d ${KRAKEN_DB} -t ${THREADS} -k ${KMER_LEN} -l ${READ_LEN} -x ${KRAKEN_INSTALLATION}
- ${KRAKEN_DB} = built Kraken database [must have library/taxonomy folders]
- Note: If using Kraken 1.0, this script will check for .kdb/.idb files and seqid2taxid.map. If using Kraken 2.0, this script for taxo.k2d and seqid2taxid.map files
- ${KMER_LEN} = length of kmer used to build the database [default: 35
- Note: Kraken 1 uses k=31 by default while Kraken 2 uses k=35.
- ${READ_LEN} = ideal length of reads in your sample [default: 100]
- ${KRAKEN_INSTALLATION} = path to kraken or kraken2 executables.
- Note: This parameter is NOT required. If not specified, the script will check for kraken2 then kraken in the user PATH. If none are found, the program will return an error. If both kraken and kraken2 are installated, the script will use kraken2 by default.
- ${THREADS} = number of threads to use for parallel processing of the library files.
Step #1a: Search all library input sequences against the database using:
kraken --db=${KRAKEN_DB} --threads=10 <( find -L ${KRAKEN_DB}/library \( -name "*.fna" -o -name "*.fasta" -o -name "*.fa" \) -exec cat {} + ) > database.kraken
kraken2 --db=${KRAKEN2_DB} --threads=10 <( find -L ${KRAKEN_DB}/library \( -name "*.fna" -o -name "*.fasta" -o -name "*.fa" \) -exec cat {} + ) > database.kraken
Step #1b: Compute the classifications for each perfect read of ${READ_LENGTH} base pairs from one of the input sequences.
./kmer2read_distr --seqid2taxid ${KRAKEN_DB}/seqid2taxid.map --taxonomy ${KRAKEN_DB}/taxonomy --kraken database.kraken --output database${READ_LEN}mers.kraken -k ${KMER_LEN} -l ${READ_LEN} -t ${THREADS}
Step #1c: Generate kmer distribution file from the.
python generate_kmer_distribution.py -i database${READ_LEN}mers.kraken; -o database${READ_LEN}mers.kmer_distrib
Step 2: Kraken-Classify a Sample AND Generate a Report File
Bracken is compatible with Kraken 1.0 or Kraken 2.0.
The basic commands for classification and report generation for Kraken 1.0 are:
kraken --db=${KRAKEN_DB} --threads ${THREADS} ${SAMPLE} > ${SAMPLE}.kraken
kraken-report --db=${KRAKEN_DB} ${SAMPLE}.kraken > ${SAMPLE}.kreport
Kraken 2.0 combines report generation into the main classification command with the addition of flag --report:
kraken2 --db=${KRAKEN2_DB} --threads ${THREADS} --report ${SAMPLE}.kreport2 ${SAMPLE} > ${SAMPLE}.kraken2
Step 3: Bracken [Abundance Estimation]
Bracken can be run using either the bracken shell script or the est_abundance python script. The shell script will check that the database contains all necessary files. The python script requires specification of the exact path to the database files.
Option 1: bracken:
bracken -d ${KRAKEN_DB} -i ${SAMPLE}.kreport -o ${SAMPLE}.bracken -r ${READ_LEN} -l ${CLASSIFICATION_LEVEL} -t ${THRESHOLD}
Option 2: est_abundance:
python estimate_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib -o ${SAMPLE}.bracken [-l ${CLASSIFICATION_LEVEL} -t ${THRESHOLD}]
Arguments (Required):
- ${SAMPLE}.kreport:: the kraken report generated for a given dataset
- ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib:: the file generated by generate_kmer_distribution.py (or by bracken-build)
- ${SAMPLE}.bracken:: the desired name of the output file to be generated by the code.
See File Formats for the required format of each input/output file
Additional Options:
- CLASSIFICATION_LEVEL [Default = 'S', Options = 'D','P','C','O','F','G','S']:: specifies the taxonomic rank to analyze. Each classification at this specified rank will receive an estimated number of reads belonging to that rank after abundance estimation.
- THRESHOLD [Default = 10]:: specifies the minimum number of reads required for a classification at the specified rank. Any classifications with less than the specified threshold will not receive additional reads from higher taxonomy levels when distributing reads for abundance estimation.
Note: This script is not multi-threaded.
File Formats
Kraken Report File Format:
(bracken input file; tab-delimited columns)
- Percentage of reads for the subtree rooted at this taxon
- Total number of reads for the subtree rooted at this taxon
- Number of reads assigned directly to this taxon level
- Taxonomical Rank [U, -, D, P, C, O, F, G, S]
- NCBI Taxonomy ID
- Scientific Name - Indented (2 spaces per level)
[Note: The Kraken report format is specified in more detail in the Kraken README]
Kraken Counts File Format:
(generate_kmer_distribution.py input file; tab-delimited columns)
- Identification string for the read
- NCBI Taxonomy ID of the genome being classified
- NCBI Taxonomy ID of the classification Kraken assigned to the full read
- Sequence length of the read
- NCBI Taxonomy ID of the classification Kraken assigned to each kmer within the read
Kmer Distribution File Format:
(generate_kmer_distribution.py output file; bracken input file; tab-delimited columns)
- NCBI Taxonomy ID for each Taxon Classification
- Space-delimited list of genomes with kmers classified at this taxon.
For each genome, the following information is listed, separated by colons ":"- NCBI Taxonomy ID for each genome
- Number of kmers from the genome that are classified at this taxon
- Total number of kmers in the genome
Bracken Output File Format:
(bracken output file; tab-delimited columns)
- Name
- Taxonomy ID
- Level ID (S=Species, G=Genus, O=Order, F=Family, P=Phylum, K=Kingdom)
- Kraken Assigned Reads
- Added Reads with Abundance Reestimation
- Total Reads after Abundance Reestimation
- Fraction of Total Reads