CCB » Software » Glimmer » Table 3

Glimmer3

Table 3:  Train on long-orfs Output & Test on Non-Hypothetical Genes

This table shows the accuracy of Glimmer3 predictions on 30 microbial genomes from RefSeq at GenBank. The ICM training data for the Glimmer3 runs was obtained by:
  1. running the Glimmer3 version of the long-orfs program (with -t 1.15 option) to find an initial set of orfs
  2. building an ICM for those orfs and running Glimmer3
  3. using these initial Glimmer3 predictions to determine start codon frequency and find a ribosome-binding site model.
The Glimmer2 training data was created by running the Glimmer2 version of the long-orfs program.
Both Glimmer3 and Glimmer2 were run with the same indicated options and the results compared to a test set consisting of all annotated genes without the word "hypothetical" in their function description, without frameshifts or in-frame stops and at least 90bp long. Glimmer2 predictions were filtered to remove those:
  1. marked as "NearReject", or
  2. contained within another prediction and having a lower score than the containing prediction.
The -g option is the minimum gene length and the -o option is the maximum overlap allowed. The -g value was obtained by finding the shortest gene in the test set. The -o value was chosen as either the maximum overlap between genes in the test set or 110, whichever was smaller.
"Matches" are predictions that had the same reading frame and stop codon as an annotated gene in the test set. "Correct Starts" are predictions that are matches and also have the same start codon as the matched gene. "Extra" are predictions that are not matches.
The columns labelled "vs. Glimmer2.13" are the Glimmer3 value minus the corresponding Glimmer2.13 value. For example, an entry of "+2" in the "Matches" column means that Glimmer3 had 2 more matches than Glimmer2.13 on the genome for that row.
It is important to note that the test set here includes genes labelled as "hypothetical" whose validity may be suspect. In fact, the annotation for some of these genomes was produced using Glimmer2, which will artificially inflate the number of matches for that program.
Genome Glimmer3 Predictions vs. Glimmer2.13  
Organism Length GC% # Genes Matches Correct Starts Extra Matches Correct Starts Extra Options
Archaeoglobus fulgidus 2.18Mb 48.6 1165 1161 99.7% 873 74.9% 1332 -2 -34 -64 -g 141 -o 67
Bacillus anthracis 5.23Mb 35.4 3132 3125 99.8% 2751 87.8% 2419 -1 +752 -144 -g 111 -o 30
Bacillus subtilis 4.21Mb 43.5 1576 1562 99.1% 1391 88.3% 3020 +3 +421 -724 -g 102 -o 93
Campylobacter jejuni 1.78Mb 30.3 1233 1233 100.0% 1150 93.3% 679 +1 +128 -70 -g 111 -o 104
Carboxydothermus hydrogenoformans 2.40Mb 42.0 1753 1750 99.8% 1580 90.1% 881 +5 +427 -216 -g 111 -o 70
Caulobacter crescentus 4.02Mb 67.2 2192 2187 99.8% 1546 70.5% 1582 +4 +75 -866 -g 123 -o 110
Chlorobium tepidum 2.15Mb 56.5 1292 1289 99.8% 934 72.3% 835 +3 +26 -400 -g 114 -o 30
Clostridium perfringens 3.03Mb 28.6 1504 1501 99.8% 1383 92.0% 1192 -1 +267 -20 -g 111 -o 30
Colwellia psychrerythraea 5.37Mb 38.0 3063 3057 99.8% 2625 85.7% 1756 -2 +395 -292 -g 93 -o 30
Dehalococcoides ethenogenes 1.47Mb 48.9 1069 1049 98.1% 903 84.5% 521 +5 +134 -59 -g 111 -o 30
Escherichia coli 4.64Mb 50.8 3603 3534 98.1% 3112 86.4% 1002 +11 +784 -843 -g 99 -o 110
Geobacter sulfurreducens 3.81Mb 60.9 2351 2337 99.4% 1933 82.2% 1165 +7 +575 -734 -g 126 -o 110
Haemophilus influenzae 1.83Mb 38.1 1170 1169 99.9% 1046 89.4% 657 -1 +125 -125 -g 111 -o 101
Helicobacter pylori 1.67Mb 38.9 915 910 99.5% 795 86.9% 788 +2 +57 -103 -g 111 -o 68
Listeria monocytogenes 2.91Mb 38.0 1966 1961 99.7% 1778 90.4% 871 +1 +429 -54 -g 111 -o 30
Methylococcus capsulatus 3.30Mb 63.6 2015 2006 99.6% 1532 76.0% 1303 +13 +365 -931 -g 93 -o 110
Mycobacterium tuberculosis 4.40Mb 65.6 2217 2201 99.3% 1465 66.1% 2176 -2 +38 -719 -g 111 -o 110
Neisseria meningitidis 2.27Mb 51.5 1232 1214 98.5% 978 79.4% 1578 +4 +208 -854 -g 93 -o 89
Porphyromonas gingivalis 2.34Mb 48.3 1200 1196 99.7% 894 74.5% 995 +2 +57 -361 -g 114 -o 62
Pseudomonas fluorescens 7.07Mb 63.3 4535 4510 99.4% 3598 79.3% 1953 +35 +895 -2359 -g 108 -o 110
Pseudomonas putida 6.18Mb 61.5 3633 3600 99.1% 2825 77.8% 2026 -10 +482 -1593 -g 114 -o 101
Ralstonia solanacearum 3.72Mb 67.0 2512 2485 98.9% 2028 80.7% 1183 +341 +1044 -2184 -g 99 -o 110
Staphylococcus epidermidis 2.62Mb 32.1 1650 1646 99.8% 1514 91.8% 791 +8 +358 -32 -g 111 -o 75
Streptococcus agalactiae 2.16Mb 35.6 1441 1436 99.7% 1326 92.0% 706 +4 +250 -22 -g 114 -o 30
Streptococcus pneumoniae 2.16Mb 39.7 1359 1346 99.0% 1203 88.5% 850 -1 +164 -51 -g 114 -o 47
Thermotoga maritima 1.86Mb 46.2 1092 1086 99.5% 881 80.7% 820 -2 +104 -120 -g 114 -o 56
Treponema denticola 2.84Mb 37.9 1463 1457 99.6% 1309 89.5% 1233 +1 +268 -207 -g 111 -o 68
Treponema pallidum 1.14Mb 52.8 575 567 98.6% 391 68.0% 567 -2 +50 -281 -g 111 -o 110
Ureaplasma parvum 0.75Mb 25.5 327 324 99.1% 295 90.2% 297 -1 +21 -11 -g 111 -o 30
Wolbachia endosymbiont 1.08Mb 34.2 628 622 99.0% 517 82.3% 559 0 +37 -86 -g 126 -o 30
Averages:    99.4%   83.1%   +14.2 +296 -484  
Notes:
  • Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.
  • Since Ureaplasma uses NCBI translation table 4 (TGA is not a stop codon), Glimmer3 must be run with an appropriate option ("-z 4" or "-Z taa,tag") and Glimmer2 must be modified and recompiled as described in its readme files in order to obtain these results.