Overview
fqtrim is a versatile stand-alone 
utility that can be used to 
  trim adapters, poly-A tails, terminal unknown bases (Ns) and low 
quality 3' regions in reads from high-throughput next-generation 
  sequencing machines. The program allows for inexact matching of 
adapters and poly-A sequences (thus accounting for mismatches and indels due to sequencing 
errors). This utility can also apply a low-complexity ("dust") filter to the reads, or count and 
collapse duplicate reads which can be particularly useful for micro-RNA analysis pipelines. 
fqtrim can be used as a pre-processing or filtering step for next-generation 
  sequence analysis pipelines (e.g. mapping, assembly) or as a 
post-processing utility for the analysis and potential recovery of
  unmapped reads or singletons resulting from such a pipeline.
Obtaining and installing fqtrim
The source archive can be downloaded here: 
		fqtrim-0.9.7.tar.gz
In order to build the fqtrim program from the source package, just unpack and run the 'make release' command:
  
    tar xvfz fqtrim-N.NN.tar.gz
  
    cd fqtrim-N.NN
  
    make release
  A pre-built Linux x86_64 binary package: fqtrim-0.9.7.Linux_x86_64.tar.gz
     Simply unpack this archive and copy the fqtrim executable in a directory of your choice.
Licensing and contact Information
fqtrim is free, open source software released under an Artistic License. You can contact us about fqtrim at: gpertea jhu edu
Usage
 The program can take as input read sequence data in FASTA or FASTQ format 
(compressed or streamed at stdin) and can process paired-end reads in a 
consistent manner (i.e. not breaking the pairs and producing two 
distinct output files with the paired reads, optionally compressed). The basic usage template is:
fqtrim [<options>] <input_file(s)>..
Input files can also be compressed FASTA or FASTQ files - but only 
the basic Linux compression extensions are recognized: gz and bz2. 
Options and input files can be provided in mixed order (options always 
start with the dash ('-') character followed by an 
alphanumeric character). When paired-reads should be provided as input 
(two separate files) and kept together, the two file names should be 
only separated by a comma or a colon character (no spaces, so the two 
file names appear as one argument to the program).
Unless the -o option is provided (see below), the trimmed/processed reads are printed at stdout.
The special input file name '-' (single dash, without quotes) will direct fqtrim to process a stream of FASTA 
or FASTQ formatted records from stdin. The main options are explained below.
| -o <outsuffix> | write the trimmed/filtered reads to file(s) 
named <input>.<outsuffix>  which will be created in the
 current (working) directory; this suffix  should include the file 
extension and if this extension is .gz, .gzip or .bz2 then the output 
will be compressed accordingly. Note: if the input file is '-' (meaning, reads are streamed from stdin) then this option provides the full name of the  output file instead of just the suffix. | 
| --outdir <outdir> | for -ooption, write the output file(s) to <outdir> path instead of the current directory. | 
| -l <minlen> | minimum read length after trimming; if the read
 sequence is shorter than this, before or after the requested trimming 
filters, the read is discarded (trashed). Default: 16. | 
| -5 <DNAseq> | look for and trim the given adapter/primer sequence at the 5' end of each read (e.g.: -5 CGACAGGTTCAGAGTTCTACAGTCCGACGATC). Note that only one 5' adapter sequence can be specified this way (multiple -5 options are not recognized). | 
| -3 <DNAseq> | look for and trim the given adapter/primer sequence at the 3' end of each read (e.g.:-3 TCGTATGCCGTCTTCTGCTTG). Note that only one 3' adapter sequence can be specified this way. | 
| -f <filename> | this is an alternative to the basic -5 and -3 
options, allowing for multiple adapter sequences to be given in a text 
file, with each line having this format: [<5'-adapter-sequence>][<delimiter><3'-adapter-sequence>] This file has a loose 2-column format, where columns are delimited by tab, space, comma, colon or semicolon characters ('\t', ' ', ';', ':' or ','). Adapter sequences to be trimmed from the 5' end should be given in the first column, while the 3' end adapters are in the 2nd column. If only the 3' adapters are to be trimmed, the corresponding line should start with one of delimiter characters mentioned above. Example: if we want to trim the adapter sequence CGACAGGTTCAGAGTTCTACAGTCCGACGATC from the left (5') end of the reads and the sequence TCGTATGCCGTCTTCTGCTTG from the 3' end, the file would have a line like this: CGACAGGTTCAGAGTTCTACAGTCCGACGATC,TCGTATGCCGTCTTCTGCTTGThere is no relationship assumed between 5' and 3' adapter sequences if they are provided on the same line. The line above is equivalent to using 2 lines, one for each adapter sequence: CGACAGGTTCAGAGTTCTACAGTCCGACGATC, TCGTATGCCGTCTTCTGCTTGNote the space at the beginning of the line providing the 3' end adapter and the comma at the end of the first line. If, on the other hand, there were no delimiter at the end of the line, e.g.: CGACAGGTTCAGAGTTCTACAGTCCGACGATC,TCGTATGCCGTCTTCTGCTTG..then the sequence on that line would be searched for at *both* ends of a read (both 5' and 3'), while the sequence on the 2nd line in this case would only be searched at the 3' end, like before. Example 2: If only 3' adapter should be trimmed (e.g. the one from Example 1), the adapter file should have a line like this, starting with a delimiter character: ,TCGTATGCCGTCTTCTGCTTG | 
| -a <minmatch> | minimum length of the suffix-prefix overlap 
between read and adapter sequence that can be trimmed at read end 
(default: 6). The default is very permissive, allowing a perfect match 
of a hexamer at the very end of the read to be trimmed if that hexamer 
is at the appropriate end of
 the adapter. This may lead to false positives and therefore 
over-trimming of the reads but it can be useful for post-processing of 
reads that were otherwise rejected by the analysis pipeline (e.g. 
unmapped or singleton reads). | 
| -A | disable automatic polyA/T trimming at read ends. Note: by default fqtrim 
looks for and trims poly-A stretches at the 3'-end and poly-T at 5'-end of each read, so the -A option should be used when such automatic poly-A/T trimming is not desired
(e.g. for genomic reads). This default behavior is a legacy of the fact that fqtrim was originally written for cleaning up transcriptome reads (especially ESTs) with poly-A tails. 
In the case of RNA-Seq reads, disabling this behavior (i.e. using fqtrim with the -A option) may be recommended in order to avoid any read data loss due to false positives. | 
| -y <minpolyLen> | minimum length of poly-A/T run to remove 
(default: 6); by default, a perfect stretch of 6 As (or more) at the 
very end of the sequence  (or 6Ts at the beginning of the sequence)
 will be trimmed. This value can be increased to avoid false positives. | 
| -q <minqv> [-w <winsize>] [-t <maxtrim>] | this option activates "quality trimming" at the
 3' end of reads (which by default is disabled); a sliding window scans 
the quality values from the 5' to the 3' end and trims the 3' end of the
 read when the average quality value drops below <minqv> (which is
 a numeric  value between 2 and some max quality value, so this 
does not depend on whether the input represents quality values in 
Phred-33 or Phred-64 format). The sliding window size can be controlled by the -w option (default: 6), while the -t option can limit the extent of the trimming triggered by this option (that is, no more than <maxtrim> bases will be trimmed off the 3' end even though the quality values may go below <minqv> beyond that position in the read) | 
| -m <maxpercN> | maximum percentage of Ns (undetermined bases) 
allowed in a read after trimming (default 5); by default fqtrim trims 
the end of the reads if they have Ns at that end, and if after this 
automatic N-based trimming the percent of Ns in the read is above this 
value, the read is discarded (trashed) | 
| -n <prefix> | rename the reads using the <prefix> followed by a read counter; if -C option was also provided, the suffix "_x<N>" is appended (where <N> is the read duplication count) | 
| -r <report.txt> | write a "trimming report" file listing the affected reads with a list
    of trimming operations and a "trash code" if the read was discarded. This report has 3 columns: 1st column is the read name, 2nd is a comma delimited list of trimming operations and the 3rd one contains a one letter "trash code" if the read did not pass the fqtrim processing (e.g. 's'means too short, other letter codes
    match the last trim operation which caused the read to be rejected, i.e. to become shorter 
    than minimum required length).The trim operations are encoded as such: 
 | 
| -s1or-s2 | for paired reads, either -s1or-s2can be used to disable processing of a specific
    read in each pair (read1 or read2), but discarding the whole pair if the other read does not pass the trimming process.This option is meant for single cell data when one read in a pair is just a barcode read which shouldn't be trimmed. | 
| -T | write the number of bases trimmed at 5' and 3' ends after the read names
    in the header of each FASTA/FASTQ output record | 
| -D | apply a low-complexity (dust) filter and discard any read that has over  50% of its length detected as low complexity | 
| -C | collapse duplicate reads and append a _x<N>count suffix to the read  name (where <N> is the 
multiplicity count for the read). This option keeps the read sequence in
 memory so it should only be used for smaller data sets, like micro-RNA 
experiments | 
| -p <numcpus> | use <numcpus> CPUs (threads) on the local
 machine to speed up the read processing for large datasets. This is 
especially useful when (multiple) adapters are provided. Note that this 
option is currently incompatible with the -C option, which does not 
support multi-threading. | 
| -Q | convert quality values to the other Phred
 quality value representation; fqtrim usually autodetects the range of 
quality values (Phred-33 or Phred-64) and this option makes the output to be converted from one range to the other. | 
| -M | disable name consistency checking for paired 
reads; normally fqtrim checks the insert names for paired-end reads, but
 some data sets may not follow the expected naming convention for the 
reads. | 
Common usage example
Cleaning up noisy exome data (paired reads) with Ns in the read sequence, allowing a minimum length of 25 bases for trimmed reads and maintaining the pairing of the reads:
fqtrim -A -l25 -o trimmed.fq.gz exome_reads_1.fastq.gz,exome_reads_2.fastq.gz
Note that for non-transcriptomic reads the -A option is advised. In this example, the output of fqtrim will be 
 written in two compressed files with the suffix ".trimmed.fq.gz".
 

