Extracting Kraken Reads
After running Kraken, Kraken2, or KrakenUniq, users may use the extract_kraken_reads.py program to extract the FASTA or FASTQ reads classified as a specific taxonomy ID. For example, this program can be used to extract all bacterial reads or only reads assigned to Escherichia coli. Users must provide (at minimum) the original sequence file(s), at least one taxonomy ID, and the Kraken output file.
Usage/Options
This section outlines the basic usage of the extract_kraken_reads.py script along with all possible parameters. Example usage of these parameters are detailed in the remaining sections. (Some parameters are explained in further detail. Click on a parameter to skip to the section describing the parameter in more detail)python extract_kraken_reads.py
-k, --kraken SAMPLE.KRAKEN......... Kraken output file
-s, -s1, -1, -U SEQUENCE.FILE...... FASTA/FASTQ sequence file (may be gzipped)
-s2, -2 SEQUENCE_2.FILE............ FASTA/FASTQ sequence file (for paired reads, may be gzipped)
-o, --output READS.FASTA........... output FASTA/FASTQ file with extracted sequnces
-t, --taxid TID TID2............... list of taxonomy IDs to extract (separated by spaces)
Optional Parameters:-o2, --output2 READS_2.FASTA....... second output FASTA/FASTQ file (required for paired input)
--fastq-output..................... produces FASTQ files for extracted reads (requires FASTQ input)
--exclude.......................... instead of finding reads matching specified taxids, searches for reads NOT matching specified taxids.
-r, --report SAMPLE.KREPORT........ Kraken report file (required if specifying --include-children or --include-parents)
--include-children................. include reads classified at more specific levels than specified taxid levels.
--include-parents.................. include reads classified at all taxonomy levels between the root and specified taxid levels.
--max #............................ maximum number of reads to save (e.g. if user wants only 10 reads extracted)
--append........................... if output file exists, appends reads (specifying this does not check formatting of output file)
--noappend......................... [default] rewrites existing output file
Input Files
-k, --kraken SAMPLE.KRAKENUser should specify the standard Kraken output file. This file contains the following tab-delimited columns:
- "C"/"U" letter code to indicate classified/unclassified
- sequence ID (obtained from from FASTA/FASTQ header)
- taxonomy ID assigned to the sequence by Kraken
- length of sequence in bp (paired: lengths separated by |)
- space-delimited list of LCA mapping of k-mers
The script uses only the 2nd and 3rd columns to extract the correct sequences. The extract_kraken_reads.py script does not accept the Kraken report format or Kraken mpa-style output.
-s, -s1, -1, -U SEQUENCE.FILE/-s2, -2 SEQUENCE_2.FILE
Input sequence files must be either FASTQ or FASTA files. Input files may be gzipped. The program will detect whether the file is gzipped based on the file extension. The program will detect the FASTQ or FASTA format based on the first character in the file (">" for FASTA, "@" for FASTQ).
Paired Input/Output
-o, --output READS.FASTA/-o2, --output2 READS_2.FASTA
Users that ran Kraken using paired reads should provide both read files AND
two output file names. For example:
python extract_kraken_reads.py -k sample.kraken -s1 read1.fq -s2 read2.fq -o extracted_1.fa -o2 extracted_2.fa
--exclude parameter
By default, reads classified at specified taxonomy IDs (and IDs included using --include-parents/--include-children) will be extracted. However, specifying --exclude will cause reads NOT classified at the specified taxonomy IDs to be extracted.For example:
- extract_kraken_reads.py ... --taxid 9606 --exclude :: extracts all non-human reads
- extract_kraken_reads.py ... --taxid 2 --exclude --include-children :: extracts all non-bacterial reads (excludes all reads classified at any classification in the Bacteria subtree)
- extract_kraken_reads.py ... --taxid 9606 --exclude --include-parents :: extracts all reads NOT classified as human (taxid 9606) OR any classification in the direct ancestry of human (e.g. will exclude reads classified at the Primate, Chordata, or Eukaryota levels).
--include-parents/--include-children parameters
By default, only reads classified exactly at the specified taxonomy IDs will be extracted. Options --include-parents and --include-children can be used to extract reads classified within the same lineage as a specified taxonomy ID.If users specify either of these options, a Kraken report file must also be provided. The Kraken report file must have the following tab-delimited columns:
- percentage of total reads at this classification
- number of reads rooted at this classification
- number of reads assigned directly to this classification
- rank code signaling type of classification (e.g. S = species, G = genus, etc)
- taxonomy ID
- indented scientific name (number of spaces reflecting tree structure
[%] [reads] [lreads] [lvl] [tid] [name]
100 1000 0 R 1 root
100 1000 0 R1 131567 cellular organisms
100 1000 50 D 2 Bacteria
0.95 950 0 P 1224 Proteobacteria
0.95 950 0 C 1236 Gammaproteobacteria
0.95 950 0 O 91347 Enterobacterales
0.95 950 0 F 543 Enterobacteriaceae
0.95 950 0 G 561 Escherichia
0.95 950 850 S 562 Escherichia coli
0.05 50 50 S1 498388 Escherichia coli C
0.05 50 50 S1 316401 Escherichia coli ETEC
- extract_kraken_reads.py ... -t 562: The 850 E. coli reads will be extracted
- extract_kraken_reads.py ... -t 562 --include-parents: The 900 E. coli and Bacteria reads will be extracted.
- extract_kraken_reads.py ... -t 562 --include-children: The 950 E. coli, E. coli C, and E. coli ETEC reads will be extracted.
- extract_kraken_reads.py ... -t 498388: The 50 E. coli C reads will be extracted.
- extract_kraken_reads.py ... -t 498388 --include-parents: The 50 E. coli C, 850 E. coli and 50 Bacteria reads will be extracted.
- extract_kraken_reads.py ... -t 1 --include-children: All classified reads will be extracted.
Author
Jennifer Lu, Ph.D.
For technical issues, bug reports, and code contributions, please use KrakenTools's GitHub repository.
Page Updated: 2020/12/09