CCB » Software » Gffcompare




The GffCompare utility

The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the “query” files), when compared with a reference annotation (also provided as GFF).

This program is based on the CuffCompare utility which is part of the Cufflinks/Tuxedo suite, so the various usage options and output files as documented in the CuffCompare manual apply to the gffcompare program as well.

From the command line, run gffcompare like so:

gffcompare [options]* {-i <input_gtf_list> | <input1.gtf> [<input2.gtf> .. <inputN.gtf>]}

Options
-iprovide a text file with a list of (query) GTF files to process instead of expecting them as command-line arguments (useful when a large number of GTF files should be processed).
-h/-helpPrints the help message and exits
-v/--versionPrints the version number and exits
-o <outprefix>All output files created by gffcompare will have this prefix (e.g. .loci, .tracking, etc.). If this option is not provided the default output prefix being used is: “gffcmp”
-rAn optional “reference” annotation GFF file. Each sample is matched against this file, and sample isoforms are tagged as overlapping, matching, or novel where appropriate. See the refmap and tmap output file descriptions below.
-RIf -r was specified, this option causes gffcompare to ignore reference transcripts that are not overlapped by any transcript in one of input1.gtf,…,inputN.gtf. Useful for ignoring annotated transcripts that are not present in your RNA-Seq samples and thus adjusting the “sensitivity” calculation in the accuracy report written in the file.
-QIf -r was specified, this option causes gffcompare to ignore input transcripts that are not overlapped by any transcript in the reference. Useful for adjusting the “precision” calculation in the accuracy report written in the file.
-Mgffcompare will discard (ignore) single-exon transfrags and reference transcripts. This option encompasses the -N option below.
-Ngffcompare will discard (ignore) discard (ignore) single-exon reference transcripts.
-s <seq_dir>Causes gffcompare to look into for fasta files with the underlying genomic sequences (one file per contig) against which your reads were aligned for some optional classification functions. For example, transcripts consisting mostly of lower-case bases are classified as repeats. Note that must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.
-e <dist>Maximum distance (range) allowed from free ends of terminal exons of reference transcripts when assessing exon accuracy. By default, this is 100.
-d <dist>Maximum distance (range) for grouping transcript start sites, by default 100.
-p <name_prefix>The name prefix to use for consensus transcripts in the <outprefix>.combined.gtf file (default: 'TCONS')
-CDiscard the “contained” transfrags in the .combined.gtf. By default, without this option, gffcompare writes in that file isoforms that were found to be fully contained/covered (with the same compatible intron structure) by other transfrags in the same locus, with the attribute “contained_in” showing the first container transfrag found. Note that this behavior is flipped versus Cuffcompare.
-ALike -C but will not discard intron-redundant transfrags if they start on a different 5' exon (keep alternate transcript start sites)
-XLike -C but also discard contained transfrags if transfrag ends stick out within the container's introns
-KFor -C/-A/-X, do NOT discard any redundant transfrag matching a reference
-TDo not generate .tmap and .refmap files for each input file
-VGffcompare is a little more verbose about what it's doing, printing messages to stderr, and it will also show warning messages about any inconsistencies or potential issues found while reading the given GFF file(s).
-DEnables debug mode, which enables -V and generates additional files: <outprefix>.Qdiscarded.lst and <outprefix>.missed_introns.gtf

Gffcompare output files

Gffcompare produces the following output files:

Overall summary statistics: <outprefix>.stats

In this output file Gffcompare reports various statistics related to the “accuracy” (or a measure of agreement) of the input transcripts when compared to reference annotation data (provided with the -r option). These accuracy measures are calculated under the assumption that the input GFF/GTF file(s) (the "query" transcripts, or transfrags, from one or multiple "samples") are coming from some transcript discovery/assembly pipeline (e.g. Cufflinks or StringTie), or from any other gene/transcript prediction pipeline. GffCompare can be used to assess the accuracy of such pipelines, when comparing their results to a known reference annotation (provided with the -r option). The measures of “sensitivity” and “precision” are calculated at various levels (nucleotide, exon, intron, transcript, gene) for each input file and reported in this .stats file.
These generic accuracy metrics are calculated as described in Burset, M., Guigó, R. : Evaluation of gene structure prediction programs (1996) Genomics, 34 (3), pp. 353-367. doi: 10.1006/geno.1996.0298 (please note that in that paper the authors used the term “Specificity” for what we currently call “Precision”). The following general formulae are used for calculation of Sensitivity and Precision at various levels (base, exon, intron chain, transcript, gene):

Sensitivity = TP / (TP+FN)

Precision = TP / (TP+FP)

Here TP means "true positives", which in this case are the query features (bases, exons, introns, transcripts, etc.) which agree with the corresponding reference annotation features. FN means "false negatives", i.e. features which are found in the reference annotation but missed (not present) in the "query" data. Accordingly, FP are features present in the input "query" data but not confirmed by any reference annnotation data. Notice that FP+TP amounts to the whole input set of "query" features (per sample/input file). If multiple "query" GTF/GFF files are given as input, these metrics are computed separately for each sample.

The “union” of all transfrags in all assemblies: <outprefix>.combined.gtf

Gffcompare reports a GTF file containing the “union” of all transfrags in each sample. If a transfrag is present in both samples, it is thus reported once in the combined gtf.

An annotation of a single file: <outprefix>.annotated.gtf

If a single GTF/GFF file is provided as input, gffcompare outputs this file instead. Its format is similar to the above, but preserves transcript IDs.

Transfrags matching to each reference transcript: <gff_in>.refmap

This tab-delimited file lists, for each reference transcript, which query transcripts either fully or partially match it. There is one row per reference transcript, and the columns are as follows:

Column number Column name Example Description
1 Reference gene name Myog The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used.
2 Reference transcript id uc007crl.1 The transcript_id attribute of the reference GTF record for this transcript
3 Class code c The type of match between the query transcripts in column 4 and the reference transcript. One of either ‘c’ for partial match, or ‘=’ for full match.
4 Matches STRG.223|STRG.223.1,STRG.224|STRG.224.1 A comma separated list of transcripts matching the reference transcript

Best reference transcript for each transfrag: <gff_in>.tmap

This tab delimited file lists the most closely matching reference transcript for each query transcript. There is one row per query transcript, and the columns are as follows:

Column number Column name Example Description
1 Reference gene name Myog The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used.
2 Reference transcript id uc007crl.1 The transcript_id attribute of the reference GTF record for this transcript
3 Class code c The type of relationship between the query transcripts in column 4 and the reference transcript (as described in the Class Codes section below)
4 Query gene id STRG.23567 The query (e.g., Stringtie) internal gene id
5 Query transcript id STRG.23567.0 The query internal transcript id
6 Number of exons 7 The number of exons in the query transcript
7 FPKM 1.4567 The expression of this transcript expressed in FPKM
8 TPM 0.000000 the estimated TPM for the transcript, if found in the query input file
9 Coverage 3.2687 The estimated average depth of read coverage across the transcript.
10 Length 1426 The length of the transcript
11 Major isoform ID STRG.23567.0 The query ID of the gene's major isoform
12 Reference match length 4370 The length of the longest overlap with a reference, ‘-’ if there is no such exonic overlap

Tracking transfrags through multiple samples: <outprefix>.tracking

This file matches transcripts up between samples. Each row contains a transcript structure that is present in one or more input GTF files. Because the transcripts will generally have different IDs (unless you assembled your RNA-Seq reads against a reference transcriptome), gffcompare examines the structure of each the transcripts, matching transcripts that agree on the coordinates and order of all of their introns, as well as strand. Matching transcripts are allowed to differ on the length of the first and last exons, since these lengths will naturally vary from sample to sample due to the random nature of sequencing.

If gffcompare is run with the -r option, the first and second columns contain the closest matching reference transcript to the one described by each row.

Here's an example of a line from the tracking file:

TCONS_00000045 XLOC_000023 Tcea|uc007afj.1	j	\
     q1:exp.115|exp.115.0|100|3.061355|0.350242|0.350207 \
     q2:60hr.292|60hr.292.0|100|4.094084|0.000000|0.000000

In this example, a transcript present in the two input files, called exp.115.0 in the first and 60hr.292.0 in the second, doesn't match any reference transcript exactly, but shares exons with uc007afj.1, an isoform of the gene Tcea, as indicated by the class code j. The first three columns are as follows:

Column number Column name Example Description
1 Query transfrag id TCONS_00000045 A unique internal id for the transfrag
2 Query locus id XLOC_000023 A unique internal id for the locus
3 Reference gene id Tcea The gene_name attribute of the reference GTF record for this transcript, or ‘-’ if no reference transcript overlaps this query transcript
4 Reference transcript id uc007afj.1 The transcript_id attribute of the reference GTF record for this transcript, or ‘-’ if no reference transcript overlaps this query transcript
5 Class code c The type of match between the query transcripts in column 6 and the reference transcript. See class codes

Each of the columns after the fifth have the following format:

qJ:<gene_id>|<transcript_id>|<num_exons>|<FPKM>|<TPM>|<cov>|<len>

A transcript need not be present in all samples to be reported in the tracking file. A sample not containing a transcript will have a “-” in its entry in the row for that transcript.

(The following output files are created for each of the <gff_in> file given, in the same directories where the <gff_in> files reside)

Transcript classification codes

If gffcompare was run with the -r option (i.e. comparing with a reference annotation), tracking rows will contain a "class code" value showing the relationship between a transfrag and the closest reference transcript (where applicable). If the -r option was not used the rows will all contain “-” in their class code column. The same codes are also shown as the value of the attribute "class_code" in the output GTF file. The class codes are shown below in decreasing order of their priority.



Example: evaluating transcript discovery accuracy

Gffcompare can be used to evaluate and compare the accuracy of transcript assemblers - in terms of their structural correctness (exon/intron coordinates). This assessment can even be performed in case of more generic "transcript discovery" programs like gene finders. The best way to do this would be to use a simulated data set (where the "reference annotation" is also the set of the expressed transcripts being simulated), but for well annotated reference genomes (human, mouse etc.), gffcompare can be used to evaluate and compare the general accuracy of isoform discovery programs on a real data set, using just the known (reference) annotation of that genome.

As a practical example, let's assume we ran both Cufflinks and StringTie on a mouse RNA-Seq sample and we want to compare the overall accuracy of the two programs. In order to compare the baseline de novo transcript assembly accuracy, both Cufflinks and StringTie should be run without using any reference annotation data (i.e. no -G or -g options were used).

Assuming that Cufflinks' transcript assembly output file name is cufflinks_asm.gtf and StringTie's output is in stringtie_asm.gtf, while the reference annotation would be in a file called mm10.gff, the gffcompare commands would be:

 gffcompare -R -r mm10.gff -o cuffcmp cufflinks_asm.gtf
 gffcompare -R -r mm10.gff -o strtcmp stringtie_asm.gtf
 

The -R option is used here in order to adjust the sensitivity calculation as to only consider the "expressed" genes, which are those reference genes for which gffcompare found at least one overlapping transfrag in the given assembly data (*_asm.gtf file). (Of course this option would not be needed in the case of simulated RNA-Seq experiments where the reference transcripts would be all "expressed"). Multiple output files will be generated by gffcompare, with the given prefix - in the example above, for the stringtie assemblies, the output files will be:

   strtcmp.combined.gtf
   strtcmp.loci
   strtcmp.stats
   strtcmp.tracking
   strtcmp.stringtie_asm.gtf.refmap
   strtcmp.stringtie_asm.gtf.tmap
Comparing the transcript assembly accuracy of the two programs is done by looking at the Sensitivity and Precision values in the *.stats output files for each program (strtcmp.stats vs cuffcmp.stats in the example above). That output looks like this (partially):
#= Summary for dataset: stringtie_asm.gtf
#     Query mRNAs :   23555 in   17628 loci  (17231 multi-exon transcripts)
#            (3731 multi-transcript loci, ~1.3 transcripts per locus)
# Reference mRNAs :   16628 in   12062 loci  (15850 multi-exon)
# Super-loci w/ reference transcripts:    11552
#-----------------| Sensitivity | Precision  |
        Base level:    82.4     |    76.5    |
        Exon level:    81.2     |    82.9    |
      Intron level:    86.1     |    94.8    |
Intron chain level:    56.9     |    52.4    |
  Transcript level:    55.2     |    38.9    |
       Locus level:    70.1     |    48.0    |
Obtaining gffcompare