CCB » Software » GffCompare

The GffCompare utility

The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the “query” files), when compared with a reference annotation (also provided as GFF).


gffcompare [options]* {-i <input_gtf_list> | <input1.gtf> [<input2.gtf> .. <inputN.gtf>]}

-iprovide a text file with a list of (query) GTF files to process instead of expecting them as command-line arguments (useful when a large number of GTF files should be processed).
-h/-helpPrints the help message and exits
-v/--versionPrints the version number and exits
-o <outprefix>All output files created by gffcompare will have this prefix (e.g. .loci, .tracking, etc.). If this option is not provided the default output prefix being used is: “gffcmp”
-rAn optional “reference” annotation GFF file. Each sample is matched against this file, and sample isoforms are tagged as overlapping, matching, or novel where appropriate. See the refmap and tmap output file descriptions below.
-RIf -r was specified, this option causes gffcompare to ignore reference transcripts that are not overlapped by any transcript in one of input1.gtf,…,inputN.gtf. Useful for ignoring annotated transcripts that are not present in your RNA-Seq samples and thus adjusting the “sensitivity” calculation in the accuracy report written in the file.
-QIf -r was specified, this option causes gffcompare to ignore input transcripts that are not overlapped by any transcript in the reference. Useful for adjusting the “precision” calculation in the accuracy report written in the file.
-Mdiscard (ignore) single-exon transfrags and reference transcripts (i.e. consider only multi-exon transcripts)
-Ndiscard (ignore) single-exon reference transcripts; single-exon transfrags are still considered, but they will never find an exact match
-Ddiscard "duplicate" (redundant) query transfrags (i.e. those with the same intron chain) within a single sample (and thus disable "annotation" mode)
-s <genome_path>path to genome sequences (optional); this will enable the "repeat" ('r') classcode assessment; <genome_path> should be a full path to a multi-FASTA file, preferrably indexed with samtools faidx; repeats must be soft-masked (lower case) in the genomic sequence
-e <dist>Maximum distance (range) allowed from free ends of terminal exons of reference transcripts when assessing exon accuracy. By default, this is 100.
-d <dist>Maximum distance (range) for grouping transcript start sites, by default 100.
-p <tprefix>The name prefix to use for consensus/combined transcripts in the <outprefix>.combined.gtf file (default: 'TCONS')
-CDiscard the “contained” transfrags from the .combined.gtf output. By default, without this option, gffcompare writes in that file isoforms that were found to be fully contained/covered (with the same compatible intron structure) by other transfrags in the same locus, with the attribute “contained_in” showing the first container transfrag found. (Note: this behavior is the opposite of Cuffcompare's -C option).
-ALike -C but will not discard intron-redundant transfrags if they start on a different 5' exon (keep alternate transcript start sites)
-XLike -C but also discard contained transfrags if transfrag ends stick out within the container's introns
-KFor -C/-A/-X, do NOT discard any redundant transfrag matching a reference
-TDo not generate .tmap and .refmap files for each input file
-VGffcompare is a little more verbose about what it's doing, printing messages to stderr, and it will also show warning messages about any inconsistencies or potential issues found while reading the given GFF file(s).
--debugEnables debug mode, which enables -V and generates additional files: <outprefix>.Q_discarded.lst, <outprefix>.missed_introns.gtf and <outprefix>.R_missed.lst

Gffcompare output files

Gffcompare produces the following output files:

Data summary and accuracy estimation: <outprefix>.stats

In this output file Gffcompare reports various statistics related to the “accuracy” (or a measure of agreement) of the input transcripts when compared to reference annotation data (provided with the -r option). These accuracy measures are calculated under the assumption that the input GFF/GTF file(s) (the "query" transcripts, or transfrags, from one or multiple "samples") are coming from some transcript discovery/assembly pipeline (e.g. Cufflinks or StringTie), or from any other gene/transcript prediction pipeline. GffCompare can be used to assess the accuracy of such pipelines, when comparing their results to a known reference annotation (provided with the -r option). The measures of “sensitivity” and “precision” are calculated at various levels (nucleotide, exon, intron, transcript, gene) for each input file and reported in this .stats file.
These generic accuracy metrics are calculated as described in Burset, M., Guigó, R. : Evaluation of gene structure prediction programs (1996) Genomics, 34 (3), pp. 353-367. doi: 10.1006/geno.1996.0298 (please note that in that paper the authors used the term “Specificity” for what we currently call “Precision”). The following general formulae are used for calculation of Sensitivity and Precision at various levels (base, exon, intron chain, transcript, gene):

Sensitivity = TP / (TP+FN)

Precision = TP / (TP+FP)

Here TP means "true positives", which in this case are the query features (bases, exons, introns, transcripts, etc.) which agree with the corresponding reference annotation features. FN means "false negatives", i.e. features which are found in the reference annotation but missed (not present) in the "query" data. Accordingly, FP are features present in the input "query" data but not confirmed by any reference annnotation data. Notice that FP+TP amounts to the whole input set of "query" features (per sample/input file). If multiple "query" GTF/GFF files are given as input, these metrics are computed separately for each sample.

Sensitivity and Precision values are estimated at various levels, which are largely an increasingly stringent way of evaluating the accuracy/correctness of a set of predicted (assembled) transcripts (transfrags), when compared to the reference annotation provided with the -r option. The levels are as follows:

Base level
Exon level
This works similarly to the Base level TP/FN/FP concept described above, but now the unit of comparison ("matching") is the exon interval on the genome, i.e. if an exon of the predicted transcript overlaps and matches the boundaries of a reference transcript exon, then it is counted as TP.
Intron Level
Similar to the Exon level, only now the units being "matched" and counted are the intron intervals instead of exon intervals: each intron of the predicted transcript is checked against any introns of the reference transcripts in the area and is there is one with the same exact start-end coordinates, it is counted as a TP.
Intron chain level
Here we count as a TP any predicted (assembled) transcript for which all of its introns can be found, with the same exact intron coordinates in a reference transcript (with the same number of introns). Matching all the introns at this level is the same as matching all the internal exons, but allowing for the possibility that the external boundaries of the terminal exons might not match at all. However it is quite realistic to consider transcripts "matching" when their intron chains are identical (and GffCompare does annotate "matching" transcripts with the class code '=' when intron chains are the same), considering that: It's important to note that Sensitivity and Precision values at this "Intron chain level" are calculated only by looking at multi-exon transcripts (for obvious reasons), so it completely ignores the single-exon transcripts, which are quite numerous in a RNA-Seq experiment (possibly due to a lot of transcriptional and alignment noise).
Transcript level
This is a more stringent version of the "Intron chain level" analysis above, as it's actually implemented as a "full exon chain" matching between the predicted transcript and any reference transcript in order to count it as TP, while still allowing a limited variation of the outer coordinates of the terminal exons, as the outer boundaries are still allowed to differ but in this case by at most 100 bases (this is the default value, but can be changed with the -e option). Another very important difference from the "Intron chain level" accuracy values is that at this level GffCompare also looks at single-exon transcripts and considers them "matching" an overlapping single-exon reference transcript when there is a very significant overlap (while still allowing their boundaries to not exactly match).
Locus level
At this level gffcompare considers that an observed locus (cluster of exon-overlapping transcripts) "matches" a similarly built reference locus (i.e. by clustering transcripts by exon overlaps), and thus counts as a TP at this level, if at least one predicted transcript has a "Transcript level match" with a reference transcript in the corresponding (overlapping) reference locus.

The "super-locus" concept

The XLOC_ prefix is assigned to super-locus identifiers in many of the GffCompare output files. A super-locus is a region of the genome where predicted transcripts and reference transcripts get clustered (grouped or linked) together by exon overlaps. When multiple samples (multiple GTF files with assembled transfrags) are provided as input to GffCompare, this clustering is performed across all the samples. Due to the transitive nature of this clustering by exon overlaps, these super-loci can occasionally get very large, sometimes merging a few distinct reference gene regions together -- especially if there is a lot of transcription or alignment noise around the individual gene regions.
Not all super-loci have reference transcripts assigned, sometimes they are just a bunch of transfrags clumped together, possibly from multiple samples, all linked together by exon overlaps, with no reference transcripts present in that region.

The “union” of all transfrags in all assemblies: <outprefix>.combined.gtf

When multiple input (query) GTF/GFF files are provided, Gffcompare reports a GTF file containing the “union” of all transfrags in each sample. If a transfrag with the same exact intron chain is present in both samples, it is thus reported only once in the combined.gtf output

An annotation of a single file: <outprefix>.annotated.gtf

If a single GTF/GFF query file is provided as input and no specific options to remove "duplicated"/redundant transfrags were given (-D, -S, -C, -A, -X), GffCompare outputs a file called <outprefix>.annotated.gtf instead of <outprefix>.combined.gtf Its format is similar to the above, but preserves transcript IDs (so the -p option is ignored)

Transfrags matching to each reference transcript: <gff_in>.refmap

This tab-delimited file lists, for each reference transcript, which query transcripts either fully or partially match it. There is one row per reference transcript, and the columns are as follows:

Column number Column name Example Description
1 Reference gene name Myog The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used.
2 Reference transcript id uc007crl.1 The transcript_id attribute of the reference GTF record for this transcript
3 Class code c The type of match between the query transcripts in column 4 and the reference transcript. One of either ‘c’ for partial match, or ‘=’ for full match.
4 Matches STRG.223|STRG.223.1,STRG.224|STRG.224.1 A comma separated list of transcripts matching the reference transcript

Best reference transcript for each transfrag: <gff_in>.tmap

This tab delimited file lists the most closely matching reference transcript for each query transcript. There is one row per query transcript, and the columns are as follows:

Column number Column name Example Description
1 Reference gene name Myog The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used.
2 Reference transcript id uc007crl.1 The transcript_id attribute of the reference GTF record for this transcript
3 Class code c The type of relationship between the query transcripts in column 4 and the reference transcript (as described in the Class Codes section below)
4 Query gene id STRG.23567 The query (e.g., Stringtie) internal gene id
5 Query transcript id STRG.23567.0 The query internal transcript id
6 Number of exons 7 The number of exons in the query transcript
7 FPKM 1.4567 The expression of this transcript expressed in FPKM
8 TPM 0.000000 the estimated TPM for the transcript, if found in the query input file
9 Coverage 3.2687 The estimated average depth of read coverage across the transcript.
10 Length 1426 The length of the transcript
11 Major isoform ID STRG.23567.0 The query ID of the gene's major isoform
12 Reference match length 4370 The length of the longest overlap with a reference, ‘-’ if there is no such exonic overlap

Tracking transfrags through multiple samples: <outprefix>.tracking

This file matches transcripts up between samples. Each row contains a transcript structure that is present in one or more input GTF files. Because the transcripts will generally have different IDs (unless you assembled your RNA-Seq reads against a reference transcriptome), gffcompare examines the structure of each the transcripts, matching transcripts that agree on the coordinates and order of all of their introns, as well as strand. Matching transcripts are allowed to differ on the length of the first and last exons, since these lengths will naturally vary from sample to sample due to the random nature of sequencing.

If gffcompare is run with the -r option, the first and second columns contain the closest matching reference transcript to the one described by each row.

Here's an example of a line from the tracking file:

TCONS_00000045 XLOC_000023 Tcea|uc007afj.1	j	\
     q1:exp.115|exp.115.0|100|3.061355|0.350242|0.350207 \

In this example, a transcript present in the two input files, called exp.115.0 in the first and 60hr.292.0 in the second, doesn't match any reference transcript exactly, but shares exons with uc007afj.1, an isoform of the gene Tcea, as indicated by the class code j. The first three columns are as follows:

Column number Column name Example Description
1 Query transfrag id TCONS_00000045 A unique internal id for the transfrag
2 Query locus id XLOC_000023 A unique internal id for the locus
3 Reference gene id Tcea The gene_name attribute of the reference GTF record for this transcript, or ‘-’ if no reference transcript overlaps this query transcript
4 Reference transcript id uc007afj.1 The transcript_id attribute of the reference GTF record for this transcript, or ‘-’ if no reference transcript overlaps this query transcript
5 Class code c The type of match between the query transcripts in column 6 and the reference transcript. See class codes

Each of the columns after the fifth have the following format:


A transcript need not be present in all samples to be reported in the tracking file. A sample not containing a transcript will have a “-” in its entry in the row for that transcript.

(The following output files are created for each of the <gff_in> file given, in the same directories where the <gff_in> files reside)

Transcript classification codes

If GffCompare was run with the -r option (i.e. comparing with a reference annotation), tracking rows will contain a "class code" value showing the relationship between a transfrag and the closest reference transcript (where applicable). If the -r option was not used the rows will all contain “-” in their class code column. The same codes are also shown as the value of the attribute "class_code" in the output GTF file. The class codes are shown below in decreasing order of their priority.

Example: evaluating transcript discovery accuracy

Gffcompare can be used to evaluate and compare the accuracy of transcript assemblers - in terms of their structural correctness (exon/intron coordinates). This assessment can even be performed in case of more generic "transcript discovery" programs like gene finders. The best way to do this would be to use a simulated data set (where the "reference annotation" is also the set of the expressed transcripts being simulated), but for well annotated reference genomes (human, mouse etc.), gffcompare can be used to evaluate and compare the general accuracy of isoform discovery programs on a real data set, using just the known (reference) annotation of that genome.

As a practical example, let's assume we ran both Cufflinks and StringTie on a mouse RNA-Seq sample and we want to compare the overall accuracy of the two programs. In order to compare the baseline de novo transcript assembly accuracy, both Cufflinks and StringTie should be run without using any reference annotation data (i.e. no -G or -g options were used).

Assuming that Cufflinks' transcript assembly output file name is cufflinks_asm.gtf and StringTie's output is in stringtie_asm.gtf, while the reference annotation would be in a file called mm10.gff, the gffcompare commands would be:

 gffcompare -R -r mm10.gff -o cuffcmp cufflinks_asm.gtf
 gffcompare -R -r mm10.gff -o strtcmp stringtie_asm.gtf

The -R option is used here in order to adjust the sensitivity calculation as to only consider the "expressed" genes, which are those reference genes for which gffcompare found at least one overlapping transfrag in the given assembly data (*_asm.gtf file). (Of course this option would not be needed in the case of simulated RNA-Seq experiments where the reference transcripts would be all "expressed"). Multiple output files will be generated by gffcompare, with the given prefix - in the example above, for the stringtie assemblies, the output files will be:

Comparing the transcript assembly accuracy of the two programs is done by looking at the Sensitivity and Precision values in the *.stats output files for each program (strtcmp.stats vs cuffcmp.stats in the example above). That output looks like this (partially):
#= Summary for dataset: stringtie_asm.gtf
#     Query mRNAs :   23555 in   17628 loci  (17231 multi-exon transcripts)
#            (3731 multi-transcript loci, ~1.3 transcripts per locus)
# Reference mRNAs :   16628 in   12062 loci  (15850 multi-exon)
# Super-loci w/ reference transcripts:    11552
#-----------------| Sensitivity | Precision  |
        Base level:    82.4     |    76.5    |
        Exon level:    81.2     |    82.9    |
      Intron level:    86.1     |    94.8    |
Intron chain level:    56.9     |    52.4    |
  Transcript level:    55.2     |    38.9    |
       Locus level:    70.1     |    48.0    |

Overlap classification for a large set of transcripts

Some pipelines can produce a very large number of potential or partial transcripts ("transfrags"), for example when merging the transcript assemblies from tens or hundreds of RNA-Seq experiments assemblies with stringtie --merge. Running GffCompare on such large GTF/GFF files could be slow and memory intensive (because GffCompare always loads the whole transcript data in memory for clustering etc.). One may only be interested to know if and how these many transcripts overlap the reference annotation, and then further analyze only those which have specific types of overlaps with the reference annotation transcripts (or none at all, i.e. if they do not overlap any of it, which may be the case for putative novel transcripts).

That's where the trmap program comes in. trmap reports for each query transcript all the reference that were found to overlap it, along with their overlap classification codes as described in the GffCompare documentation above. The main feature of trmap is that it allows streaming of a very large file with query transcripts (in GFF or BED format) to be checked and classified against a reference annotation (again, in GFF or BED format).

trmap ("transcript vs. refererence mapping") first loads the reference annotation file in memory as an interval tree and then streams the query file (which can also be provided at stdin) while checking and reporting any overlaps found, and classifies the relationship with reference transcripts using a (subset of) the "class codes" like those assigned by GffCompare.
trmap is distributed with GffCompare and it shares the same overlap classification code.


  trmap [-S] [-o ] <ref_gff> <query_gff>
Positional arguments:
  <ref_gff>    reference annotation file name (GFF/BED format)
  <query_gff>  query file name (GFF/BED format) or "-" for stdin
  -o <outfile> write output to <outfile> instead of stdout
  -S           report only simple reference overlap percentages, without
               classification (one line per query)
The default output is a pseudo-FASTA format showing a record for each query transcript that had at least one reference overlap/relationship. Example:

The query transcript (MSTRG.36.1 in this example) is shown in the header of the record, with space delimited fields showing the genomic location and strand. Each reference overlap follows, as a line with tab delimited fields, starting with the "classification code" for the overlap and then providing the genomic location of the transcript (chromosome, strand, transcript-start, transcript-end, reference_transcriptID, exons).
The exons for both query and reference transcripts are shown as comma delimited lists of intervals.
These are all 1-based coordinates like in the GTF/GFF format (even when input is BED).

Obtaining GffCompare