CCB » Software » Glimmer » Table 4

Glimmer3

Table 4:  Train on long-orfs Output & Test on All Annotated Genes

This table shows the accuracy of Glimmer3 predictions on 30 microbial genomes from RefSeq at GenBank. The ICM training data for the Glimmer3 runs was obtained by:
  1. running the Glimmer3 version of the long-orfs program (with -t 1.15 option) to find an initial set of orfs
  2. building an ICM for those orfs and running Glimmer3
  3. using these initial Glimmer3 predictions to determine start codon frequency and find a ribosome-binding site model.
The Glimmer2 training data was created by running the Glimmer2 version of the long-orfs program.
Both Glimmer3 and Glimmer2 were run with the same indicated options and the results compared to a test set consisting of all annotated genes without frameshifts or in-frame stops and at least 90bp long. Note that many genes in this test set are hypothetical, which often means the only evidence for them is that they were predicted by a gene-finding program (which in many cases was Glimmer2). It comes as no surprise, therefore, that for many genomes Glimmer3 has fewer matches than Glimmer2.
Glimmer2 predictions were filtered to remove those:
  1. marked as "NearReject", or
  2. contained within another prediction and having a lower score than the containing prediction.
The -g option is the minimum gene length and the -o option is the maximum overlap allowed. The -g value was obtained by finding the shortest gene in the test set. The -o value was chosen as either the maximum overlap between genes in the test set or 110, whichever was smaller.
"Matches" are predictions that had the same reading frame and stop codon as an annotated gene in the test set. "Correct Starts" are predictions that are matches and also have the same start codon as the matched gene. "Extra" are predictions that are not matches.
The columns labelled "vs. Glimmer2.13" are the Glimmer3 value minus the corresponding Glimmer2.13 value. For example, an entry of "+2" in the "Matches" column means that Glimmer3 had 2 more matches than Glimmer2.13 on the genome for that row.
Genome Glimmer3 Predictions vs. Glimmer2.13  
Organism Length GC% # Genes Matches Correct Starts Extra Matches Correct Starts Extra Options
Archaeoglobus fulgidus 2.18Mb 48.6 2398 2337 97.5% 1715 71.5% 156 +3 -85 -69 -g 141 -o 67
Bacillus anthracis 5.23Mb 35.4 5308 5110 96.3% 4379 82.5% 434 +7 +1097 -152 -g 111 -o 30
Bacillus subtilis 4.21Mb 43.5 4095 4023 98.2% 3438 84.0% 559 +11 +1030 -732 -g 102 -o 93
Campylobacter jejuni 1.78Mb 30.3 1836 1790 97.5% 1649 89.8% 122 -1 +176 -68 -g 111 -o 104
Carboxydothermus hydrogenoformans 2.40Mb 42.0 2606 2466 94.6% 2179 83.6% 165 -20 +478 -191 -g 111 -o 70
Caulobacter crescentus 4.02Mb 67.2 3737 3514 94.0% 2299 61.5% 255 -62 -321 -800 -g 123 -o 110
Chlorobium tepidum 2.15Mb 56.5 2252 1964 87.2% 1357 60.3% 160 -76 -174 -321 -g 114 -o 30
Clostridium perfringens 3.03Mb 28.6 2660 2638 99.2% 2408 90.5% 55 +5 +493 -26 -g 111 -o 30
Colwellia psychrerythraea 5.37Mb 38.0 4902 4572 93.3% 3818 77.9% 241 -96 +334 -198 -g 93 -o 30
Dehalococcoides ethenogenes 1.47Mb 48.9 1579 1484 94.0% 1237 78.3% 86 -4 +137 -50 -g 111 -o 30
Escherichia coli 4.64Mb 50.8 4231 4124 97.5% 3616 85.5% 412 +16 +881 -848 -g 99 -o 110
Geobacter sulfurreducens 3.81Mb 60.9 3438 3284 95.5% 2673 77.7% 218 -22 +750 -705 -g 126 -o 110
Haemophilus influenzae 1.83Mb 38.1 1649 1639 99.4% 1439 87.3% 187 0 +148 -126 -g 111 -o 101
Helicobacter pylori 1.67Mb 38.9 1556 1520 97.7% 1284 82.5% 178 0 +87 -101 -g 111 -o 68
Listeria monocytogenes 2.91Mb 38.0 2819 2752 97.6% 2480 88.0% 80 -7 +597 -46 -g 111 -o 30
Methylococcus capsulatus 3.30Mb 63.6 2958 2845 96.2% 2090 70.7% 464 -24 +448 -894 -g 93 -o 110
Mycobacterium tuberculosis 4.40Mb 65.6 4189 3901 93.1% 2435 58.1% 476 -81 -267 -640 -g 111 -o 110
Neisseria meningitidis 2.27Mb 51.5 2055 1891 92.0% 1493 72.7% 901 +10 +253 -860 -g 93 -o 89
Porphyromonas gingivalis 2.34Mb 48.3 1909 1792 93.9% 1278 66.9% 399 -23 -79 -336 -g 114 -o 62
Pseudomonas fluorescens 7.07Mb 63.3 6134 6044 98.5% 4677 76.2% 419 +11 +987 -2335 -g 108 -o 110
Pseudomonas putida 6.18Mb 61.5 5349 5165 96.6% 3881 72.6% 461 -71 +297 -1532 -g 114 -o 101
Ralstonia solanacearum 3.72Mb 67.0 3435 3346 97.4% 2700 78.6% 322 +452 +1378 -2295 -g 99 -o 110
Staphylococcus epidermidis 2.62Mb 32.1 2487 2343 94.2% 2123 85.4% 94 +4 +474 -28 -g 111 -o 75
Streptococcus agalactiae 2.16Mb 35.6 2122 2052 96.7% 1872 88.2% 90 +5 +337 -23 -g 114 -o 30
Streptococcus pneumoniae 2.16Mb 39.7 2093 1929 92.2% 1654 79.0% 267 -18 +98 -34 -g 114 -o 47
Thermotoga maritima 1.86Mb 46.2 1854 1816 98.0% 1427 77.0% 90 -3 +137 -119 -g 114 -o 56
Treponema denticola 2.84Mb 37.9 2761 2594 94.0% 2249 81.5% 96 -41 +380 -165 -g 111 -o 68
Treponema pallidum 1.14Mb 52.8 1034 990 95.7% 619 59.9% 144 -11 -21 -272 -g 111 -o 110
Ureaplasma parvum 0.75Mb 25.5 614 605 98.5% 526 85.7% 16 -5 +13 -7 -g 111 -o 30
Wolbachia endosymbiont 1.08Mb 34.2 805 790 98.1% 642 79.8% 391 -4 +28 -82 -g 126 -o 30
Averages:    95.8%   77.8%   -1.5 +336 -468  
Notes:
  • Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.
  • Since Ureaplasma uses NCBI translation table 4 (TGA is not a stop codon), Glimmer3 must be run with an appropriate option ("-z 4" or "-Z taa,tag") and Glimmer2 must be modified and recompiled as described in its readme files in order to obtain these results.