Many bioinformatics programs represent genes and transcripts in GFF format (General Feature
Format) which simply describes the locations and the attributes of gene and transcript features on the genome
(chromosome or scaffolds/contigs).
GFF has many versions, but the two most popular that are GTF2 (Gene Transfer Format, described here) and GFF3 (defined here). Here are a few details about the way these formats are interpreted by StringTie and other bioinformatics programs.
As seen in the GTF2
specification, the transcript_id attribute is required, so our GFF parser also expects it, while a gene_id
attribute, though not strictly required by our GFF parser, is very useful for grouping alternative transcripts under
a gene/locus identifier. An optional gene_name attribute, if found, will be taken and shown as
a symbolic gene name or short-form abbreviation (e.g. gene symbols from HGNC or Entrez Gene). Some annotation
sources (e.g. Ensembl) place a "human readable" gene name/symbol in the gene_name
attribute, like a HUGO symbol (while gene_id might be just an automatically generated numeric identifier for the
Most bioinformatics programs generally expect at least the exon features which are enough to define the overall transcript structure, with optional CDS features to specify the coding segments. Our GFF reader will ignore redundant features like start_codon, stop_codon when whole CDS features were provided, and similarly UTR features are ignored when whole exon features were given. However, it is still possible to provide only CDS and *UTR features and our GFF parser will reassemble the exonic structure accordingly (internally converting these segments to exon segments).
Example of a GTF2 transcript record with minimal attributes:
The GTF output of programs like StringTie and Cufflinks also have an additional transcript feature line acting as a parent feature for the exon and CDS features which define the transcript structure and have the same transcript_id attribute. This is not required by the GTF2 specification (and makes such files more similar to GFF3, which always have a "parent" feature, see below) but may be useful for holding the global attributes of each transcript (instead of having them duplicated for each exon feature).
GFF3 filesAs defined by the GFF3 specification, the parent features (usually transcripts, i.e. "mRNA" features) are required to have an ID attribute, but here again an optional gene_name attribute can be used to specify a common gene name abbreviation. If gene_name is not given, it can be also inferred from the Name or ID attributes of the parent gene feature of the current parent mRNA feature (if given in the input file). CDS and exon features are required to have a Parent attribute whose value must match the value of the ID attribute of a parent transcript feature (usually a "mRNA" feature).
Example of a GFF3 transcript record with minimal attributes:
Feature restrictionsFor various reasons our GFF parser currently assumes the following limits (maximum values) for the genomic length (span) of gene and transcript features:
- genes and transcripts should not span more than 7 Megabases on the genomic sequence
- exons should not be longer than 30 Kilobases
- introns should not be larger than 6 Megabases
Also, transcript IDs are expected to be unique per GFF input file (though we relaxed this restriction by limiting it to a chromosome/contig scope).
The gffread utility
The program gffread can be used to validate, filter, convert and perform various other operations on GFF files (use gffread -h to see the various usage options). Because the program shares the same GFF parser code with Cufflinks, Stringtie, and gffcompare, it could be used to verify that a GFF file from a certain annotation source is correctly "understood" by these programs. Thus the gffread utility can be used to simply read the transcripts from the file, and optionally print these transcripts back, in either GFF3 (default) or GTF2 format (with the -T option), while discarding any non-essential attributes, optionally fixing some potential issues with the input file(s). The command line for such a quick cleanup and a quick visual inspection of a given GFF file could be:
gffread -E annotation.gtf -o- | less
This will show the minimalist GFF3 re-formatting of the transcript records found in the input file (annotation.gtf in this example) which could be given in either GFF3 or GTF2 format. The -E option directs gffread to "expose" (display warnings about) any potential issues encountered while parsing the input file.
In order to see the GTF2 version of the same transcripts, the -T option should be added:
gffread -E annotation.gff -T -o- | more
The examples above also show that gffread can be used to convert a file between GTF2 and GFF3 file formats.
Extracting transcript sequences
gffread can also be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation a fasta file with the genomic sequences have to be provided as well. For example, one might want to extract the sequence of all transfrags assembled from a Cufflinks assembly session. This can be accomplished with a command line like this:
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
The file genome.fa in this example would be a multi-fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa. This should be the case in our example if genome.fa is the file corresponding to the same genome (index) that was used for mapping the reads with Tophat.
Note that the retrieval of the transcript sequences this way is going to be much faster if a fasta index file (genome.fa.fai in this example) is found in the same directory with the genomic fasta file. Such an index file can be created with the samtools utility prior to running gffread, like this:
samtools faidx genome.fa
Then in subsequent runs using the -g option gffread will find this FASTA index and use it to speed up the extraction of transcript sequences.
- project on Github: https://github.com/gpertea/gffread
- source package: gffread-0.9.8c.tar.gz
- precompiled Linux x86_64 binary (statically built on CentOS 5) : gffread-0.9.8c.Linux_x86_64.tar.gz
- precompiled Apple OS X binary (for OS X 10.7 and higher): gffread-0.9.8c.OSX_x86_64.tar.gz
The gffcompare utility
The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the "query" files), when compared with a reference annotation (also provided as GFF).
This program is based on the CuffCompare utility which is part of the Cufflinks/Tuxedo suite, so the various usage options and output files as documented in the CuffCompare manual apply to the gffcompare program as well.
Example: evaluating transcript discovery accuracy
Gffcompare can be used to evaluate and compare the accuracy of transcript assemblers - in terms of their structural correctness (exon/intron coordinates). This assessment can even be performed in case of more generic "transcript discovery" programs like gene finders. The best way to do this would be to use a simulated data set (where the "reference annotation" is also the set of the expressed transcripts being simulated), but for well annotated reference genomes (human, mouse etc.), gffcompare can be used to evaluate and compare the general accuracy of isoform discovery programs on a real data set, using just the known (reference) annotation of that genome.
As a practical example, let's assume we ran both Cufflinks and StringTie on a mouse RNA-Seq sample and we want to compare the overall accuracy of the two programs. In order to compare the baseline de novo transcript assembly accuracy, both Cufflinks and StringTie should be run without using any reference annotation data (i.e. no -G or -g options were used).
Assuming that Cufflinks' transcript assembly output file name is cufflinks_asm.gtf and StringTie's output is in stringtie_asm.gtf, while the reference annotation would be in a file called mm10.gff, the gffcompare commands would be:
gffcompare -R -r mm10.gff -o cuffcmp cufflinks_asm.gtf gffcompare -R -r mm10.gff -o strtcmp stringtie_asm.gtf
The -R option is used here in order to adjust the sensitivity calculation as to only consider the "expressed" genes, which are those reference genes for which gffcompare found at least one overlapping transfrag in the given assembly data (*_asm.gtf file). (Of course this option would not be needed in the case of simulated RNA-Seq experiments where the reference transcripts would be all "expressed"). Multiple output files will be generated by gffcompare, with the given prefix - in the example above, for the stringtie assemblies, the output files will be:
strtcmp.combined.gtf strtcmp.loci strtcmp.stats strtcmp.tracking strtcmp.stringtie_asm.gtf.refmap strtcmp.stringtie_asm.gtf.tmapComparing the transcript assembly accuracy of the two programs is done by looking at the Sensitivity and Precision values in the *.stats output files for each program (strtcmp.stats vs cuffcmp.stats in the example above). That output looks like this (partially):
#= Summary for dataset: stringtie_asm.gtf # Query mRNAs : 23555 in 17628 loci (17231 multi-exon transcripts) # (3731 multi-transcript loci, ~1.3 transcripts per locus) # Reference mRNAs : 16628 in 12062 loci (15850 multi-exon) # Super-loci w/ reference transcripts: 11552 #-----------------| Sensitivity | Precision | Base level: 82.4 | 76.5 | Exon level: 81.2 | 82.9 | Intron level: 86.1 | 94.8 | Intron chain level: 56.9 | 52.4 | Transcript level: 55.2 | 38.9 | Locus level: 70.1 | 48.0 |
- project on Github: https://github.com/gpertea/gffcompare
- source package: gffcompare-0.9.9c.tar.gz
- precompiled Linux x86_64 binary (statically built on CentOS 5) : gffcompare-0.9.9c.Linux_x86_64.tar.gz
- precompiled Apple OS X binary (for OS X 10.7 and higher): gffcompare-0.9.9c.OSX_x86_64.tar.gz