CCB » CBCC » JIGSAW
 

JIGSAW: gene prediction using multiple sources of evidence

 

Overview

JIGSAW is a program designed to use the output from gene finders, splice site prediction programs and sequence alignments to predict gene models. The program provides an automated way to take advantage of the many succsessful methods for computational gene prediction and can provide substantial improvements in accuracy over an individual gene prediction program.

JIGSAW is available for all species. We have tested JIGSAW on Human, Rice (Oryza sativa), Arabidopsis thaliana, C. elegans, Brugia malayi, Cryptococcus neoformans, Entamoeba histolytica, Theileria parva, Aspergillus fumigatus, Plasmodium falciparum and Plasmodium yoelii.

UPDATE!
The linear combiner option is now available in the current JIGSAW software distribution. This allows JIGSAW to be run without the use of training data. A weight is assigned to each evidence source, and gene predictions are based on a weighted voting scheme, yielding the best 'consensus' predictions.

Predictions are now available for the ENCODE regions in Human and viewable as custom tracks in the UCSC Human Genome Browser
Predictions available for the Human genome and viewable as custom tracks in the UCSC Human Genome Browser

Accuracy

Prediction
Program
Gene
sensitivity
Gene
precision
Exon
sensitivity
Exon
precision
Nucleotide
sensitivity
Nucleotide
precision
JIGSAW 59% 66% 87% 89% 90% 98%
Ensembl 62% 50% 85% 80% 85% 95%
UCSC's KnownGene track 65% 38% 84% 77% 82% 93%

Table 1


The results in Table 1 measure accuracy of JIGSAW, Ensembl and cDNA alignments from the UCSC genome browser in Human. The test is made up of 1563 genes. JIGSAW uses the output from Ensembl and the cDNA alignments along with many other evidence sources available in the UCSC genome database, including other gene finders and expression evidence.  Sensitivity measures the percentage of true genes (exons/nucleotides) that the program finds.  Precision measures the percentage of the program's predicted genes (exons/nucleotides) that are correct.

Prediction
Program
Correct
Genes
Missed
Genes
Correct
Exons
Missed
Exons
Nucleotide
Sensitivity
JIGSAW 54% 3% 86% 4% 97%
FgenesH 42% 4% 80% 8% 95%
GeneMark.hmm 26% 5% 51% 28% 76%

Table 2


The results in Table 2 measure accuracy of JIGSAW, FgenesH and GeneMark.hmm in Oryza sativa. The test set includes 5,595 genes from 26,827 exons. JIGSAW uses the output from FgenesH, GlimmerR, GeneMark.hmm, Genscan and splice site predictions from GeneSplicer, sequence alignments from a protein database and sequence alignments from the TIGR gene indices.


Prediction
Program
Correct
Genes
Missed
Genes
Correct
Exons
Missed
Exons
Nucleotide
Sensitivity
JIGSAW 78% 1% 93% 3% 98%
TwinScan 67% 1% 87% 4% 96%
GeneMark.hmm 45% 2% 79% 5% 96%
Genscan 37% 2% 75% 10% 92%
GlimmerM 32% 1% 71% 9% 93%

Table 3


The results in Table 3 measure the accuracy of gene prediction programs in Arabidopsis thaliana. The test set includes 1,783 genes from 7,510 exons. JIGSAW uses output from the other gene prediction programs listed in the table, an earlier version of GlimmerM, splice site predictions from GeneSplicer, sequence alignments from a protein database and sequence alignments from the TIGR gene indices.

Using JIGSAW

A training set is given to JIGSAW, which consists of example output from an automated gene structure annotation pipeline along with sequence coordinates of known genes. JIGSAW compares the pipeline's predicted genes to the example known genes to record the prediction accuracy of each combination of evidence. A non-linear model is built to estimate the accuracy of the different combinations of evidence found in new data. JIGSAW pieces together gene structure models most likely to be accuracte based on statistics collected in the training set.
JIGSAW predicts gene models for a user supplied genomic sequence. The main interface is a simple "evidence list" file, which lists the file names of each prediction program's output, file format and the type of evidence. JIGSAW reads several coordinate based file formats including GFF.

System requirements

JIGSAW is developed in C++ and compiles using GNU gcc 3.2 or newer.

Download

To download the most recent JIGSAW system, just click HERE.

This software is OSI Certified Open Source Software.


Documentation

The distribution includes documentation on how to get started. Included in the distrubtion is a tutorial demonstrating step by step the process of training and running JIGSAW. The tutorial is available online HERE.

Software development documentation

Library API
Application API

References

J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regionsGenome Biology 2007, 7(Suppl):S9.

J. E. Allen and S. L. Salzberg. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18): 3596-3603, 2005.

J. E. Allen, M. Pertea and S. L. Salzberg. Computational gene prediction using multiple sources of evidence. Genome Research, 14(1), 2004.

Acknowledgements

Development of JIGSAW was supported in part by the NIH grant RO1-LM06845 to SLS.

Contact Information

jeallen - umiacs umd edu