CCB » Software » fqtrim

Overview

fqtrim is a versatile stand-alone utility that can be used to trim adapters, poly-A tails, terminal unknown bases (Ns) and low quality 3' regions in reads from high-throughput next-generation sequencing machines. The program allows for inexact matching of adapters and poly-A sequences (thus accounting for mismatches and indels due to sequencing errors). This utility can also apply a low-complexity ("dust") filter to the reads, or count and collapse duplicate reads which can be particularly useful for micro-RNA analysis pipelines.
fqtrim can be used as a pre-processing or filtering step for next-generation sequence analysis pipelines (e.g. mapping, assembly) or as a post-processing utility for the analysis and potential recovery of unmapped reads or singletons resulting from such a pipeline.

Obtaining and installing fqtrim

The source archive can be downloaded here: fqtrim-0.9.7.tar.gz
In order to build the fqtrim program from the source package, just unpack and run the 'make release' command:

    tar xvfz fqtrim-N.NN.tar.gz
    cd fqtrim-N.NN
    make release

A pre-built Linux x86_64 binary package: fqtrim-0.9.7.Linux_x86_64.tar.gz
Simply unpack this archive and copy the fqtrim executable in a directory of your choice.

Licensing and contact Information

fqtrim is free, open source software released under an Artistic License. You can contact us about fqtrim at: gpertea jhu edu

DOI
    10.5281/zenodo.593893

Usage

The program can take as input read sequence data in FASTA or FASTQ format (compressed or streamed at stdin) and can process paired-end reads in a consistent manner (i.e. not breaking the pairs and producing two distinct output files with the paired reads, optionally compressed). The basic usage template is:

 fqtrim [<options>] <input_file(s)>..

Input files can also be compressed FASTA or FASTQ files - but only the basic Linux compression extensions are recognized: gz and bz2. Options and input files can be provided in mixed order (options always start with the dash ('-') character followed by an alphanumeric character). When paired-reads should be provided as input (two separate files) and kept together, the two file names should be only separated by a comma or a colon character (no spaces, so the two file names appear as one argument to the program).
Unless the -o option is provided (see below), the trimmed/processed reads are printed at stdout. The special input file name '-' (single dash, without quotes) will direct fqtrim to process a stream of FASTA or FASTQ formatted records from stdin. The main options are explained below.

-o <outsuffix>
write the trimmed/filtered reads to file(s) named <input>.<outsuffix>  which will be created in the current (working) directory; this suffix  should include the file extension and if this extension is .gz, .gzip or .bz2 then the output will be compressed accordingly. Note: if the input file is '-' (meaning, reads are streamed from stdin) then this option provides the full name of the  output file instead of just the suffix.
--outdir <outdir> for -o option, write the output file(s) to <outdir> path instead of the current directory.
-l <minlen>
minimum read length after trimming; if the read sequence is shorter than this, before or after the requested trimming filters, the read is discarded (trashed). Default: 16.
-5 <DNAseq> look for and trim the given adapter/primer sequence at the 5' end of each read  (e.g.: -5 CGACAGGTTCAGAGTTCTACAGTCCGACGATC). Note that only one 5' adapter sequence can be specified this way (multiple -5 options are not recognized).
-3 <DNAseq>
look for and trim the given adapter/primer sequence at the 3' end of each read  (e.g.:-3 TCGTATGCCGTCTTCTGCTTG). Note that only one 3' adapter sequence can be specified this way.
-f <filename> this is an alternative to the basic -5 and -3 options, allowing for multiple adapter sequences to be given in a text file, with each line having this format:
[<5'-adapter-sequence>][<delimiter><3'-adapter-sequence>]
This file has a loose 2-column format, where columns are delimited by tab, space, comma, colon or semicolon characters ('\t', ' ', ';', ':' or ','). Adapter sequences to be trimmed from the 5' end should be given in the first column, while the 3' end adapters are in the 2nd column. If only the 3' adapters are to be trimmed, the corresponding line should start with one of delimiter characters mentioned above.

Example: if we want to trim the adapter sequence CGACAGGTTCAGAGTTCTACAGTCCGACGATC from the left (5') end of the reads and the sequence TCGTATGCCGTCTTCTGCTTG from the 3' end, the file would have a line like this:

CGACAGGTTCAGAGTTCTACAGTCCGACGATC,TCGTATGCCGTCTTCTGCTTG

There is no relationship assumed between 5' and 3' adapter sequences if they are provided on the same line. The line above is equivalent to using 2 lines, one for each adapter sequence:

CGACAGGTTCAGAGTTCTACAGTCCGACGATC,
 TCGTATGCCGTCTTCTGCTTG

Note the space at the beginning of the line providing the 3' end adapter and the comma at the end of the first line. If, on the other hand, there were no delimiter at the end of the line, e.g.:

CGACAGGTTCAGAGTTCTACAGTCCGACGATC
,TCGTATGCCGTCTTCTGCTTG

..then the sequence on that line would be searched for at *both* ends of a read (both 5' and 3'), while the sequence on the 2nd line in this case would only be searched at the 3' end, like before.

Example 2: If only 3' adapter should be trimmed (e.g. the one from Example 1), the adapter file should have a line like this, starting with a delimiter character:

,TCGTATGCCGTCTTCTGCTTG

-a <minmatch> minimum length of the suffix-prefix overlap between read and adapter sequence that can be trimmed at read end (default: 6). The default is very permissive, allowing a perfect match of a hexamer at the very end of the read to be trimmed if that hexamer is at the appropriate end of the adapter. This may lead to false positives and therefore over-trimming of the reads but it can be useful for post-processing of reads that were otherwise rejected by the analysis pipeline (e.g. unmapped or singleton reads).
-A  disable automatic polyA/T trimming at read ends. Note: by default fqtrim looks for and trims poly-A stretches at the 3'-end and poly-T at 5'-end of each read, so the -A option should be used when such automatic poly-A/T trimming is not desired (e.g. for genomic reads). This default behavior is a legacy of the fact that fqtrim was originally written for cleaning up transcriptome reads (especially ESTs) with poly-A tails. In the case of RNA-Seq reads, disabling this behavior (i.e. using fqtrim with the -A option) may be recommended in order to avoid any read data loss due to false positives.
-y <minpolyLen> minimum length of poly-A/T run to remove (default: 6); by default, a perfect stretch of 6 As (or more) at the very end of the sequence  (or 6Ts at the beginning of the sequence) will be trimmed. This value can be increased to avoid false positives.
-q <minqv>
[-w <winsize>]
[-t <maxtrim>]
this option activates "quality trimming" at the 3' end of reads (which by default is disabled); a sliding window scans the quality values from the 5' to the 3' end and trims the 3' end of the read when the average quality value drops below <minqv> (which is a numeric  value between 2 and some max quality value, so this does not depend on whether the input represents quality values in Phred-33 or Phred-64 format).
The sliding window size can be controlled by the -w option (default: 6), while the -t option can limit the extent of the trimming triggered by this option (that is, no more than <maxtrim> bases will be trimmed off the 3' end even though the quality values may go below <minqv> beyond that position in the read)
-m <maxpercN>
maximum percentage of Ns (undetermined bases) allowed in a read after trimming (default 5); by default fqtrim trims the end of the reads if they have Ns at that end, and if after this automatic N-based trimming the percent of Ns in the read is above this value, the read is discarded (trashed)
-n <prefix> rename the reads using the <prefix> followed by a read counter;  if -C option was also provided, the suffix "_x<N>" is appended  (where <N> is the read duplication count)
-r <report.txt>
write a "trimming report" file listing the affected reads with a list of trimming operations and a "trash code" if the read was discarded.
This report has 3 columns: 1st column is the read name, 2nd is a comma delimited list of trimming operations and the 3rd one contains a one letter "trash code" if the read did not pass the fqtrim processing (e.g. 's' means too short, other letter codes match the last trim operation which caused the read to be rejected, i.e. to become shorter than minimum required length).
The trim operations are encoded as such:
  • 1 digit number (5 or 3) representing the 5' or 3' end where the trimming was performed
  • a one letter code for the trimming operation type:
    • Q : trimming due to low quality values
    • N : trimming due to high density of undetermined bases (Ns) in the DNA sequence
    • A : poly-A trimming
    • T : poly-T trimming
    • V (or lowercase letters a, b, c etc.) : vector/adapter trimming. Lower case letters are used instead of 'V' when multiple vector/adapter sequences were provided with option -f and the --aidx option was provided, in which case the alphabetical order of the code matches the order of the vector/adapter sequences.
  • the number of bases that were trimmed
-s1 or -s2 for paired reads, either -s1 or -s2 can be used to disable processing of a specific read in each pair (read1 or read2), but discarding the whole pair if the other read does not pass the trimming process.
This option is meant for single cell data when one read in a pair is just a barcode read which shouldn't be trimmed.
-T
write the number of bases trimmed at 5' and 3' ends after the read names in the header of each FASTA/FASTQ output record
-D
apply a low-complexity (dust) filter and discard any read that has over  50% of its length detected as low complexity
-C
collapse duplicate reads and append a _x<N> count suffix to the read  name (where <N> is the multiplicity count for the read). This option keeps the read sequence in memory so it should only be used for smaller data sets, like micro-RNA experiments
-p  <numcpus> use <numcpus> CPUs (threads) on the local machine to speed up the read processing for large datasets. This is especially useful when (multiple) adapters are provided. Note that this option is currently incompatible with the -C option, which does not support multi-threading.
-Q   convert quality values to the other Phred quality value representation; fqtrim usually autodetects the range of quality values (Phred-33 or Phred-64) and this option makes the output to be converted from one range to the other.
-M
disable name consistency checking for paired reads; normally fqtrim checks the insert names for paired-end reads, but some data sets may not follow the expected naming convention for the reads.

Common usage example

Cleaning up noisy exome data (paired reads) with Ns in the read sequence, allowing a minimum length of 25 bases for trimmed reads and maintaining the pairing of the reads:

 fqtrim -A -l25 -o trimmed.fq.gz exome_reads_1.fastq.gz,exome_reads_2.fastq.gz 

Note that for non-transcriptomic reads the -A option is advised. In this example, the output of fqtrim will be written in two compressed files with the suffix ".trimmed.fq.gz".