KrakenTools

Extracting Kraken Reads

After running Kraken, Kraken2, or KrakenUniq, users may use the extract_kraken_reads.py program to extract the FASTA or FASTQ reads classified as a specific taxonomy ID. For example, this program can be used to extract all bacterial reads or only reads assigned to Escherichia coli. Users must provide (at minimum) the original sequence file(s), at least one taxonomy ID, and the Kraken output file.

Usage/Options

This section outlines the basic usage of the extract_kraken_reads.py script along with all possible parameters. Example usage of these parameters are detailed in the remaining sections. (Some parameters are explained in further detail. Click on a parameter to skip to the section describing the parameter in more detail)
python extract_kraken_reads.py

-k, --kraken SAMPLE.KRAKEN......... Kraken output file

-s, -s1, -1, -U SEQUENCE.FILE...... FASTA/FASTQ sequence file (may be gzipped)

-s2, -2 SEQUENCE_2.FILE............ FASTA/FASTQ sequence file (for paired reads, may be gzipped)

-o, --output READS.FASTA........... output FASTA/FASTQ file with extracted sequnces

-t, --taxid TID TID2............... list of taxonomy IDs to extract (separated by spaces)

Optional Parameters:

-o2, --output2 READS_2.FASTA....... second output FASTA/FASTQ file (required for paired input)

--fastq-output..................... produces FASTQ files for extracted reads (requires FASTQ input)

--exclude.......................... instead of finding reads matching specified taxids, searches for reads NOT matching specified taxids.

-r, --report SAMPLE.KREPORT........ Kraken report file (required if specifying --include-children or --include-parents)

--include-children................. include reads classified at more specific levels than specified taxid levels.

--include-parents.................. include reads classified at all taxonomy levels between the root and specified taxid levels.

--max #............................ maximum number of reads to save (e.g. if user wants only 10 reads extracted)

--append........................... if output file exists, appends reads (specifying this does not check formatting of output file)

--noappend......................... [default] rewrites existing output file

Input Files

-k, --kraken SAMPLE.KRAKEN

User should specify the standard Kraken output file. This file contains the following tab-delimited columns:

"C"/"U" letter code to indicate classified/unclassified
sequence ID (obtained from from FASTA/FASTQ header)
taxonomy ID assigned to the sequence by Kraken
length of sequence in bp (paired: lengths separated by |)
space-delimited list of LCA mapping of k-mers

The script uses only the 2nd and 3rd columns to extract the correct sequences. The extract_kraken_reads.py script does not accept the Kraken report format or Kraken mpa-style output.

-s, -s1, -1, -U SEQUENCE.FILE/-s2, -2 SEQUENCE_2.FILE

Input sequence files must be either FASTQ or FASTA files. Input files may be gzipped. The program will detect whether the file is gzipped based on the file extension. The program will detect the FASTQ or FASTA format based on the first character in the file (">" for FASTA, "@" for FASTQ).

Paired Input/Output

-o, --output READS.FASTA/-o2, --output2 READS_2.FASTA

Users that ran Kraken using paired reads should provide both read files AND two output file names. For example:
python extract_kraken_reads.py -k sample.kraken -s1 read1.fq -s2 read2.fq -o extracted_1.fa -o2 extracted_2.fa

`--exclude` parameter

By default, reads classified at specified taxonomy IDs (and IDs included using --include-parents/--include-children) will be extracted. However, specifying --exclude will cause reads NOT classified at the specified taxonomy IDs to be extracted.

For example:

extract_kraken_reads.py ... --taxid 9606 --exclude :: extracts all non-human reads
extract_kraken_reads.py ... --taxid 2 --exclude --include-children :: extracts all non-bacterial reads (excludes all reads classified at any classification in the Bacteria subtree)
extract_kraken_reads.py ... --taxid 9606 --exclude --include-parents :: extracts all reads NOT classified as human (taxid 9606) OR any classification in the direct ancestry of human (e.g. will exclude reads classified at the Primate, Chordata, or Eukaryota levels).

`--include-parents`/`--include-children` parameters

By default, only reads classified exactly at the specified taxonomy IDs will be extracted. Options --include-parents and --include-children can be used to extract reads classified within the same lineage as a specified taxonomy ID.

If users specify either of these options, a Kraken report file must also be provided. The Kraken report file must have the following tab-delimited columns:

percentage of total reads at this classification
number of reads rooted at this classification
number of reads assigned directly to this classification
rank code signaling type of classification (e.g. S = species, G = genus, etc)
taxonomy ID
indented scientific name (number of spaces reflecting tree structure

Currently, this script does not accept reports containing additional columns (as produced by KrakenUniq or Kraken 2's --report-minimizer-data nor does this script accept mpa-style reports (as produced when specifying --use-mpa-style.

Example Usage: Given a Kraken report containing the following...

        
            [%]     [reads]     [lreads]    [lvl]   [tid]   [name]
            100     1000        0           R       1       root
            100     1000        0           R1      131567    cellular organisms
            100     1000        50          D       2           Bacteria
            0.95    950         0           P       1224          Proteobacteria
            0.95    950         0           C       1236            Gammaproteobacteria
            0.95    950         0           O       91347             Enterobacterales
            0.95    950         0           F       543                 Enterobacteriaceae
            0.95    950         0           G       561                   Escherichia
            0.95    950         850         S       562                     Escherichia coli
            0.05    50          50          S1      498388                    Escherichia coli C
            0.05    50          50          S1      316401                    Escherichia coli ETEC

extract_kraken_reads.py ... -t 562: The 850 E. coli reads will be extracted
extract_kraken_reads.py ... -t 562 --include-parents: The 900 E. coli and Bacteria reads will be extracted.
extract_kraken_reads.py ... -t 562 --include-children: The 950 E. coli, E. coli C, and E. coli ETEC reads will be extracted.
extract_kraken_reads.py ... -t 498388: The 50 E. coli C reads will be extracted.
extract_kraken_reads.py ... -t 498388 --include-parents: The 50 E. coli C, 850 E. coli and 50 Bacteria reads will be extracted.
extract_kraken_reads.py ... -t 1 --include-children: All classified reads will be extracted.

Author

Jennifer Lu, Ph.D.

For technical issues, bug reports, and code contributions, please use KrakenTools's GitHub repository.

Page Updated: 2020/12/09

Back to top