CCB » Software » GFF utilities




GFF files

Many bioinformatics programs represent genes and transcripts in GFF format (General Feature Format) which simply describes the locations and the attributes of gene and transcript features on the genome (chromosome or scaffolds/contigs).
GFF has many versions, but the two most popular that are GTF2 (Gene Transfer Format, described here) and GFF3 (defined here). Here are a few details about the way these formats are interpreted by StringTie and other bioinformatics programs.

GTF files

As seen in the GTF2 specification, the transcript_id attribute is required, so our GFF parser also expects it, while a gene_id attribute, though not strictly required by our GFF parser, is very useful for grouping alternative transcripts under a gene/locus identifier. An optional gene_name attribute, if found, will be taken and shown as  a symbolic gene name or short-form abbreviation (e.g. gene symbols from HGNC or Entrez Gene). Some annotation sources (e.g. Ensembl) place a "human readable" gene name/symbol in the gene_name attribute, like a HUGO symbol (while gene_id might be just an automatically generated numeric identifier for the gene).

Most bioinformatics programs generally expect at least the exon features which are enough to define the overall transcript structure, with optional CDS features to specify the coding segments. Our GFF reader will ignore redundant features like start_codon, stop_codon when whole CDS features were provided, and similarly UTR features are ignored when whole exon features were given. However, it is still possible to provide only CDS and *UTR features and our GFF parser will reassemble the exonic structure accordingly (internally converting these segments to exon segments).

Example of a GTF2 transcript record with minimal attributes:


The GTF output of programs like StringTie and Cufflinks also have an additional transcript feature line acting as a parent feature for the exon and CDS features which define the transcript structure and have the same transcript_id attribute. This is not required by the GTF2 specification (and makes such files more similar to GFF3, which always have a "parent" feature, see below) but may be useful for holding the global attributes of each transcript (instead of having them duplicated for each exon feature).

GFF3 files
As defined by the GFF3 specification, the parent features (usually transcripts, i.e. "mRNA" features) are required to have an ID attribute, but here again an optional gene_name attribute can be used to specify a common gene name abbreviation. If gene_name is not given, it can be also inferred from the Name or ID attributes of the parent gene feature of the current parent mRNA feature (if given in the input file). CDS and exon features are required to have a Parent attribute whose value must match the value of the ID attribute of a parent transcript feature (usually a "mRNA" feature).

Example of a GFF3 transcript record with minimal attributes:

Feature restrictions
For various reasons our GFF parser currently assumes the following limits (maximum values) for the genomic length (span) of gene and transcript features:

Also, transcript IDs are expected to be unique per GFF input file (though we relaxed this restriction by limiting it to a chromosome/contig scope).

The gffread utility

The program gffread can be used to validate, filter, convert and perform various other operations on GFF files (use gffread -h to see the various usage options). Because the program shares the same GFF parser code with Cufflinks, Stringtie, and gffcompare, it could be used to verify that a GFF file from a certain annotation source is correctly "understood" by these programs. Thus the gffread utility can be used to simply read the transcripts from the file, and optionally print these transcripts back, in either GFF3 (default) or GTF2 format (with the -T option), while discarding any non-essential attributes, optionally fixing some potential issues with the input file(s). The command line for such a quick cleanup and a quick visual inspection of a given GFF file could be:

gffread -E annotation.gtf -o- | less 

This will show the minimalist GFF3 re-formatting of the transcript records found in the input file (annotation.gtf in this example) which could be given in either GFF3 or GTF2 format. The -E option directs gffread to "expose" (display warnings about) any potential issues encountered while parsing the input file.

In order to see the GTF2 version of the same transcripts, the -T option should be added:

gffread -E annotation.gff -T -o- | more

The examples above also show that gffread can be used to convert a file between GTF2 and GFF3 file formats.

Extracting transcript sequences

gffread can also be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation a fasta file with the genomic sequences have to be provided as well. For example, one might want to extract the sequence of all transfrags assembled from a Cufflinks assembly session. This can be accomplished with a command line like this:

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

The file genome.fa in this example would be a multi-fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa. This should be the case in our example if genome.fa is the file corresponding to the same genome (index) that was used for mapping the reads with Tophat.

Note that the retrieval of the transcript sequences this way is going to be much faster if a fasta index file (genome.fa.fai in this example) is found in the same directory with the genomic fasta file. Such an index file can be created with the samtools utility prior to running gffread, like this:

samtools faidx genome.fa

Then in subsequent runs using the -g option gffread will find this FASTA index and use it to speed up the extraction of transcript sequences.

Obtaining gffread

The gffcompare utility

The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the "query" files), when compared with a reference annotation (also provided as GFF).

This program is based on the CuffCompare utility which is part of the Cufflinks/Tuxedo suite, so the various usage options and output files as documented in the CuffCompare manual apply to the gffcompare program as well.

Example: evaluating transcript discovery accuracy

Gffcompare can be used to evaluate and compare the accuracy of transcript assemblers - in terms of their structural correctness (exon/intron coordinates). This assessment can even be performed in case of more generic "transcript discovery" programs like gene finders. The best way to do this would be to use a simulated data set (where the "reference annotation" is also the set of the expressed transcripts being simulated), but for well annotated reference genomes (human, mouse etc.), gffcompare can be used to evaluate and compare the general accuracy of isoform discovery programs on a real data set, using just the known (reference) annotation of that genome.

As a practical example, let's assume we ran both Cufflinks and StringTie on a mouse RNA-Seq sample and we want to compare the overall accuracy of the two programs. In order to compare the baseline de novo transcript assembly accuracy, both Cufflinks and StringTie should be run without using any reference annotation data (i.e. no -G or -g options were used).

Assuming that Cufflinks' transcript assembly output file name is cufflinks_asm.gtf and StringTie's output is in stringtie_asm.gtf, while the reference annotation would be in a file called mm10.gff, the gffcompare commands would be:

 gffcompare -R -r mm10.gff -o cuffcmp cufflinks_asm.gtf
 gffcompare -R -r mm10.gff -o strtcmp stringtie_asm.gtf
 

The -R option is used here in order to adjust the sensitivity calculation as to only consider the "expressed" genes, which are those reference genes for which gffcompare found at least one overlapping transfrag in the given assembly data (*_asm.gtf file). (Of course this option would not be needed in the case of simulated RNA-Seq experiments where the reference transcripts would be all "expressed"). Multiple output files will be generated by gffcompare, with the given prefix - in the example above, for the stringtie assemblies, the output files will be:

   strtcmp.combined.gtf
   strtcmp.loci
   strtcmp.stats
   strtcmp.tracking
   strtcmp.stringtie_asm.gtf.refmap
   strtcmp.stringtie_asm.gtf.tmap
Comparing the transcript assembly accuracy of the two programs is done by looking at the Sensitivity and Precision values in the *.stats output files for each program (strtcmp.stats vs cuffcmp.stats in the example above). That output looks like this (partially):
#= Summary for dataset: stringtie_asm.gtf
#     Query mRNAs :   23555 in   17628 loci  (17231 multi-exon transcripts)
#            (3731 multi-transcript loci, ~1.3 transcripts per locus)
# Reference mRNAs :   16628 in   12062 loci  (15850 multi-exon)
# Super-loci w/ reference transcripts:    11552
#-----------------| Sensitivity | Precision  |
        Base level:    82.4     |    76.5    |
        Exon level:    81.2     |    82.9    |
      Intron level:    86.1     |    94.8    |
Intron chain level:    56.9     |    52.4    |
  Transcript level:    55.2     |    38.9    |
       Locus level:    70.1     |    48.0    |
Obtaining gffcompare