GlimmerHMM | User Manual

GlimmerHMM User Manual

Overview
Training an organism specific GlimmerHMM
Running GlimmerHMM
Obtaining GlimmerHMM

I. Overview

GlimmerHMM is a new gene finder based on a Generalized Hidden Markov Model (GHMM). Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program. The variable-length feature states (e.g., exons, introns, intergenic regions) are implemented using N^th-order interpolated Markov models (IMM) as described in Delcher et al. 1999 for N=8. Currently, GlimmerHMM's GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single), as depicted in the figure below.

There are a number of assumptions used by GlimmerHMM when predicting genes in the DNA sequence.
The main assumptions are

(1) the coding region of every gene begins with a start codon ATG (but partial genes can be predicted),

(2) a gene has no in-frame stop codons except the very last codon,

(3) each exon is in a consistent reading frame with the previous exon.

These constraints significantly enhance the efficiency of computing the optimal gene models, by restricting the search space of the GHMM algorithm. On the other hand, genuine frame shifts cannot be detected by the system.

III.Training an organism specific GlimmerHMM

III.1. Need of a specific organism training

Our earlier implementation of a gene finder - GlimmerM - trained for rice produced clearly superior gene models than GlimmerM trained for Arabidopsis (Pertea and Salzberg 2002). This is also valid for GlimmerHMM. Although the current trainings of GlimmerHMM should work well on closely related organisms, the user should use our training procedure to re-train the system for an other eukaryotic organism in order to improve the accuracy of the gene prediction. For all species that GlimmerHMM is already trained, we expect the performance to improve even further by re-training the system as the amount of DNA sequences from these species continues to grow.

III.2 Data collection

First of all it should be noted that a thorough collection of a good training data set is a step that should precede the training of any gene finder. This step is very important since the quality of the data presented for training is directly proportional with the accuracy of the resulted gene finder. As any species-specific gene finder, GlimmerHMM needs a training data set containing as many as possible complete coding sequences from the organism genome for which the gene prediction is intended. By surveying the public databases, one should be able to find all previously known genes of the target organism that are backed by laboratory evidence. These genes would form a sound training data set if their number would be large enough. Unfortunately, this is rarely the case, so one should use other methods in order to construct a reliable data set. From our experience, most often the training data set was formed by searching long ORFs (with more than 500 bases) against non redundant protein sequence databases by using programs like BLAST in order to map known proteins into the genome.

III.3 Training GlimmerHMM

To train GlimmerM you should run trainGlimmerM with the following command:

trainGlimmerHMM <mfasta_file> <exon_file> [optional_parameters]

<mfasta_file> and <exon_file> are the multi-FASTA file and the file containing the exon coordinates of the known genes, respectively.

<mfasta_file> is a multifasta file containing the sequences for training with the usual format:

>seq1
AGTCGTCGCTAGCTAGCTAGCATCGAGTCTTTTCGATCGAGGACTAGACTT
CTAGCTAGCTAGCATAGCATACGAGCATATCGGTCATGAGACTGATTGGGC
>seq2
TTTAGCTAGCTAGCATAGCATACGAGCATATCGGTAGACTGATTGGGTTTA
TGCGTTA

<exon_file> is a file with the exon coordinates relative to the sequences contained in the <mfasta_file>; different genes are separated by a blank line; I am assuming a format like below:

seq1 5 15
seq1 20 34

seq1 50 48
seq1 45 36

seq2 17 20

In this example seq1 has two genes: one on the direct strand and another one on the complementary strand. Here you can find a real example of fasta and exon files.

If not enough data is available for training the splice sites the training procedure will be unsuccessful and exit with a warning message. In this case the user should collect more known genes with introns and then try the training procedure again. If not enough data is available for training the translational start and stop sites of the genes, the user will need to collect more genes. A minimum number of 50 genes with standard start codons (ATG) and stop codons (TAA/TAG/TGA) is assumed by default.

The optional parameters that can be given to the training program are:


-i i₁,i₂,...,i_n	isochores to be considered for training (e.g. if two isochores are desired with 0-40% GC content and 40-100% then the option should be: -i 0,40,100; default is set to -i 0,100 )
-f val	val = average value of upstream UTR region if known
-l val	val = average value of downstream UTR region if known
-n val	val = average value of intergenic region if known
-v val	val = value of flanking region to be considered around genes (default=200)
-b val	val = build 1st or 2nd order markov model for the splice sites (default=1)

III.4 Changing the parameters obtained during training

The effects of specific feature states on the modeling of human This step is optional to the user but many times a manual tuning of the system could improve the accuracy of the predictions. This is due partially to the fact that some of the parameters required for the training of the system might not be known for a certain organism (e.g. average size of the intergenic regions) so one might want to try different values for these parameters. We noticed that a gradient-ascent procedure which search through the parameter space similar to the one described in Majoros and Salzberg 2004 will often optimize the accuracy of the training.
Choosing adequate thresholds that determine the false negative-false positive ratio for specific site detection may be a challenging problem given the small number of true sites and the overwhelming number of false positives, and requires a suite of decisions that maximizes the accuracy of the recognition task without a big loss in the sensitivity. In the training procedure of GlimmerHMM the threshold for calling a sequence a real splice site is chosen by examining the trade-off in the false positive rate.
The system creates a sorted list of thresholds, adjusting the scoring function so that it will miss 1,2,3, etc. true sites. The default threshold is chosen to be the score at which the false positive rate drops by less than 1%. To allow a greater flexibility in setting the signals' thresholds, our training procedure allows the user to consult the false positive and false negative rates specific for the training data and set his own threshold.

After running trainGlimmerHMM, a log file can be consulted to find the default values set for some of the parameters of GlimmerHMM. The file config.file in the training directory specifies for each isochore which configuration file to use. For the default case, when no isochores are considered, there will be only one line with the foillowing info:
C+G <= 100 train_0_100.cfg
All changes in parameters can be done in the train_0_100.cfg file. Each parameter appears on a single line of the .cfg file. The parameters that can be changed are described in the following table:


line in the .cfg file	Description
acceptor_threshold value	acceptor site threshold value; the false negative/false positive rates for different acceptor thresholds can be consulted from the false.acc file.
donor_threshold value	donor site threshold value; the false negative/false positive rates for different donor thresholds can be consulted from the false.don file.
ATG_threshold value	start codon threshold value; the false negative/false positive rates for different start codon thresholds can be consulted from the false.atg file.
Stop_threshold value	stop codon threshold value; the false negative/false positive rates for different stop codon thresholds can be consulted from the false.stop file.
split_penalty value	a factor penalizing the gene finder's tendency to split genes.
intergenic_val value	the minimum intergenic distance espected between genes.
intergenic_penalty value	a penalty factor for genes situated closer than the intergenic_val value.
MeanIntergen value	the average size of intergenic regions.
BoostExon value	a factor to increase the sensitivity of exon prediction.
BoostSplice value	a factor to increase the score of good scoring splice sites.
BoostSgl value	a factor to increase the predicted number of single exon genes.
onlytga value	set this value to 1 if TGA is the only stop codon in the genome.
onlytaa value	set this value to 1 if TAA is the only stop codons in the genome.
onlytag value	set this value to 1 if TAG is the only stop codons in the genome.

IV. Running GlimmerHMM

The program GlimmerHMM takes two inputs: a DNA sequence file in FASTA format and a directory containing the training files for the program. If not specified, the training directory is by default the current working directory. No options are implemented at this time for the program.

VI. Obtaining GlimmerHMM

GlimmerHMM is available free of charge under the open-source Artistic License.

To download GlimmerHMM please click here.

References

Delcher, A.L., Harmon, D., Kasif, S., White,O. and Salzberg, S.L. Improved microbial gene identification with GLIMMER Nucleic Acids Research, 27:23 (1999), 4636-4641.

Majoros, W.H., Pertea, M., and Salzberg, S.L. TigrScan and GlimmerHMM: two open-source ab initio eukaryotic gene-finders Bioinformatics 20 2878-2879.

Majoros,W.M. and Salzberg, S.L. (2004) An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5:206.

Pertea, M., X. Lin, et al. (2001). "GeneSplicer: a new computational method for splice site prediction." Nucleic Acids Res 29(5): 1185-90.

Pertea, M. and S. L. Salzberg (2002). "Computational gene finding in plants." Plant Molecular Biology 48(1-2): 39-48.

Pertea, M., S. L. Salzberg, et al. (2000). "Finding genes in Plasmodium falciparum." Nature 404(6773): 34; discussion 34-5.

Salzberg, S. L., M. Pertea, et al. (1999). "Interpolated Markov models for eukaryotic gene finding." Genomics 59(1): 24-31.

Yuan, Q., J. Quackenbush, et al. (2001). "Rice bioinformatics. analysis of rice sequence data and leveraging the data to other plant species." Plant Physiol 125(3): 1166-74.