Overview
fqtrim is a versatile stand-alone
utility that can be used to
trim adapters, poly-A tails, terminal unknown bases (Ns) and low
quality 3' regions in reads from high-throughput next-generation
sequencing machines. The program allows for inexact matching of
adapters and poly-A sequences (thus accounting for mismatches and indels due to sequencing
errors). This utility can also apply a low-complexity ("dust") filter to the reads, or count and
collapse duplicate reads which can be particularly useful for micro-RNA analysis pipelines.
fqtrim can be used as a pre-processing or filtering step for next-generation
sequence analysis pipelines (e.g. mapping, assembly) or as a
post-processing utility for the analysis and potential recovery of
unmapped reads or singletons resulting from such a pipeline.
Obtaining and installing fqtrim
The source archive can be downloaded here:
fqtrim-0.9.7.tar.gz
In order to build the fqtrim program from the source package, just unpack and run the 'make release' command:
tar xvfz fqtrim-N.NN.tar.gz
cd fqtrim-N.NN
make release
A pre-built Linux x86_64 binary package: fqtrim-0.9.7.Linux_x86_64.tar.gz
Simply unpack this archive and copy the fqtrim executable in a directory of your choice.
Licensing and contact Information
fqtrim is free, open source software released under an Artistic License. You can contact us about fqtrim at: gpertea jhu edu
Usage
The program can take as input read sequence data in FASTA or FASTQ format
(compressed or streamed at stdin) and can process paired-end reads in a
consistent manner (i.e. not breaking the pairs and producing two
distinct output files with the paired reads, optionally compressed). The basic usage template is:
fqtrim [<options>] <input_file(s)>..
Input files can also be compressed FASTA or FASTQ files - but only
the basic Linux compression extensions are recognized: gz and bz2.
Options and input files can be provided in mixed order (options always
start with the dash ('-') character followed by an
alphanumeric character). When paired-reads should be provided as input
(two separate files) and kept together, the two file names should be
only separated by a comma or a colon character (no spaces, so the two
file names appear as one argument to the program).
Unless the -o option is provided (see below), the trimmed/processed reads are printed at stdout.
The special input file name '-' (single dash, without quotes) will direct fqtrim to process a stream of FASTA
or FASTQ formatted records from stdin. The main options are explained below.
-o <outsuffix> |
write the trimmed/filtered reads to file(s)
named <input>.<outsuffix> which will be created in the
current (working) directory; this suffix should include the file
extension and if this extension is .gz, .gzip or .bz2 then the output
will be compressed accordingly. Note: if the input file is '-' (meaning, reads are streamed from stdin) then this option provides the full name of the output file instead of just the suffix. |
--outdir <outdir> |
for -o option, write the output file(s) to <outdir> path instead of the current directory.
|
-l <minlen> |
minimum read length after trimming; if the read
sequence is shorter than this, before or after the requested trimming
filters, the read is discarded (trashed). Default: 16. |
-5 <DNAseq> | look for and trim the given adapter/primer sequence at the 5' end of each read (e.g.: -5 CGACAGGTTCAGAGTTCTACAGTCCGACGATC). Note that only one 5' adapter sequence can be specified this way (multiple -5 options are not recognized). |
-3 <DNAseq> |
look for and trim the given adapter/primer sequence at the 3' end of each read (e.g.:-3 TCGTATGCCGTCTTCTGCTTG). Note that only one 3' adapter sequence can be specified this way. |
-f <filename> | this is an alternative to the basic -5 and -3
options, allowing for multiple adapter sequences to be given in a text
file, with each line having this format: [<5'-adapter-sequence>][<delimiter><3'-adapter-sequence>] This file has a loose 2-column format, where columns are delimited by tab, space, comma, colon or semicolon characters ('\t', ' ', ';', ':' or ','). Adapter sequences to be trimmed from the 5' end should be given in the first column, while the 3' end adapters are in the 2nd column. If only the 3' adapters are to be trimmed, the corresponding line should start with one of delimiter characters mentioned above. Example: if we want to trim the adapter sequence CGACAGGTTCAGAGTTCTACAGTCCGACGATC from the left (5') end of the reads and the sequence TCGTATGCCGTCTTCTGCTTG from the 3' end, the file would have a line like this: CGACAGGTTCAGAGTTCTACAGTCCGACGATC,TCGTATGCCGTCTTCTGCTTG There is no relationship assumed between 5' and 3' adapter sequences if they are provided on the same line. The line above is equivalent to using 2 lines, one for each adapter sequence: CGACAGGTTCAGAGTTCTACAGTCCGACGATC,
TCGTATGCCGTCTTCTGCTTG Note the space at the beginning of the line providing the 3' end adapter and the comma at the end of the first line. If, on the other hand, there were no delimiter at the end of the line, e.g.: CGACAGGTTCAGAGTTCTACAGTCCGACGATC
,TCGTATGCCGTCTTCTGCTTG ..then the sequence on that line would be searched for at *both* ends of a read (both 5' and 3'), while the sequence on the 2nd line in this case would only be searched at the 3' end, like before. Example 2: If only 3' adapter should be trimmed (e.g. the one from Example 1), the adapter file should have a line like this, starting with a delimiter character: ,TCGTATGCCGTCTTCTGCTTG |
-a <minmatch> | minimum length of the suffix-prefix overlap
between read and adapter sequence that can be trimmed at read end
(default: 6). The default is very permissive, allowing a perfect match
of a hexamer at the very end of the read to be trimmed if that hexamer
is at the appropriate end of
the adapter. This may lead to false positives and therefore
over-trimming of the reads but it can be useful for post-processing of
reads that were otherwise rejected by the analysis pipeline (e.g.
unmapped or singleton reads). |
-A | disable automatic polyA/T trimming at read ends. Note: by default fqtrim
looks for and trims poly-A stretches at the 3'-end and poly-T at 5'-end of each read, so the -A option should be used when such automatic poly-A/T trimming is not desired
(e.g. for genomic reads). This default behavior is a legacy of the fact that fqtrim was originally written for cleaning up transcriptome reads (especially ESTs) with poly-A tails.
In the case of RNA-Seq reads, disabling this behavior (i.e. using fqtrim with the -A option) may be recommended in order to avoid any read data loss due to false positives. |
-y <minpolyLen> | minimum length of poly-A/T run to remove
(default: 6); by default, a perfect stretch of 6 As (or more) at the
very end of the sequence (or 6Ts at the beginning of the sequence)
will be trimmed. This value can be increased to avoid false positives. |
-q <minqv> [-w <winsize>] [-t <maxtrim>] |
this option activates "quality trimming" at the
3' end of reads (which by default is disabled); a sliding window scans
the quality values from the 5' to the 3' end and trims the 3' end of the
read when the average quality value drops below <minqv> (which is
a numeric value between 2 and some max quality value, so this
does not depend on whether the input represents quality values in
Phred-33 or Phred-64 format). The sliding window size can be controlled by the -w option (default: 6), while the -t option can limit the extent of the trimming triggered by this option (that is, no more than <maxtrim> bases will be trimmed off the 3' end even though the quality values may go below <minqv> beyond that position in the read) |
-m <maxpercN> |
maximum percentage of Ns (undetermined bases)
allowed in a read after trimming (default 5); by default fqtrim trims
the end of the reads if they have Ns at that end, and if after this
automatic N-based trimming the percent of Ns in the read is above this
value, the read is discarded (trashed) |
-n <prefix> | rename the reads using the <prefix> followed by a read counter; if -C option was also provided, the suffix "_x<N>" is appended (where <N> is the read duplication count) |
-r <report.txt> |
write a "trimming report" file listing the affected reads with a list
of trimming operations and a "trash code" if the read was discarded. This report has 3 columns: 1st column is the read name, 2nd is a comma delimited list of trimming operations and the 3rd one contains a one letter "trash code" if the read did not pass the fqtrim processing (e.g. 's' means too short, other letter codes
match the last trim operation which caused the read to be rejected, i.e. to become shorter
than minimum required length).The trim operations are encoded as such:
|
-s1 or -s2
|
for paired reads, either -s1 or -s2 can be used to disable processing of a specific
read in each pair (read1 or read2), but discarding the whole pair if the other read does not pass the trimming process.This option is meant for single cell data when one read in a pair is just a barcode read which shouldn't be trimmed. |
-T |
write the number of bases trimmed at 5' and 3' ends after the read names
in the header of each FASTA/FASTQ output record |
-D |
apply a low-complexity (dust) filter and discard any read that has over 50% of its length detected as low complexity |
-C |
collapse duplicate reads and append a _x<N>
count suffix to the read name (where <N> is the
multiplicity count for the read). This option keeps the read sequence in
memory so it should only be used for smaller data sets, like micro-RNA
experiments |
-p <numcpus> | use <numcpus> CPUs (threads) on the local
machine to speed up the read processing for large datasets. This is
especially useful when (multiple) adapters are provided. Note that this
option is currently incompatible with the -C option, which does not
support multi-threading. |
-Q | convert quality values to the other Phred
quality value representation; fqtrim usually autodetects the range of
quality values (Phred-33 or Phred-64) and this option makes the output to be converted from one range to the other. |
-M |
disable name consistency checking for paired
reads; normally fqtrim checks the insert names for paired-end reads, but
some data sets may not follow the expected naming convention for the
reads. |
Common usage example
Cleaning up noisy exome data (paired reads) with Ns in the read sequence, allowing a minimum length of 25 bases for trimmed reads and maintaining the pairing of the reads:
fqtrim -A -l25 -o trimmed.fq.gz exome_reads_1.fastq.gz,exome_reads_2.fastq.gz
Note that for non-transcriptomic reads the -A
option is advised. In this example, the output of fqtrim will be
written in two compressed files with the suffix ".trimmed.fq.gz".