// Copyright (c) 1997-2005 by Arthur Delcher and Steven Salzberg. // All our code is OPEN SOURCE - see the LICENSE file. Feel free // to redistribute! // Version 1.02 revised 25 Feb 98 to ignore the independent // (random) model for long orfs. The default // length for "long" in this case is set to the length at which // exactly 1 orf of this length would be expected per 1 million // bases given the gc content of the genome. This value also can be // set by command-line option -q . // Version 1.03 revised 8 Feb 99 to make it easier to specify // start and stop codons. // Version 1.04 revised 10 May 99 to add -l command-line switch // to both glimmer and long-orfs to regard genome as *NOT* // circular. Default is to regard it as circular. // Version 2.0 uses a tree-based IMM as described in the references // given in the README file. It also implements an extensive new // algorithm (see the paper) to adjust the start locations of genes // whose initial coordinates result in an overlap. // Version: 2.01 31 Jul 98 // Change probability model // Simplify wraparounds // Move start codons to eliminate overlaps // Discount independent model scores when // there are no overlaps // Uses Harmon's model // Version: 2.03 9 Dec 2002 // Include raw scores in output // Add strict option to use independent intergenic // model that discounts stop codons // Add option to score each entry from a list of coordinates // separately, without overlapping/voting rules // Version: 2.10 5 Feb 2003 // Strict option to use independent intergenic // model that discounts stop codons is only behaviour // Version: 2.11 18 Apr 2003 // Change long-orfs to automatically compute the // optimal value of ORF length in order to maximize // the amount of training data. Program glimmer takes two inputs: a sequence file (in FASTA format) and a collection of Markov models for genes as produced by the program build-icm . It outputs a list of all open reading frames (orfs) together with scores for each as a gene. The first few lines of output specify the settings of various parameter in the program: Minimum gene length is the length of the smallest fragment considered to be a gene. The length is measured from the first base of the start codon to the last base *before* the stop codon. This value can be specified when running the program with the -g option. Minimum overlap length is a lower bound on the number of bases overlap between 2 genes that is considered a problem. Overlaps shorter than this are ignored. Minimum overlap percent is another lower bound on the number of bases overlap that is considered a problem. Overlaps shorter than this percentage of *both* genes are ignored. Threshold score is the minimum in-frame score for a fragment to be considered a potential gene. Use independent scores indicates whether the last column that scores each fragment using independent base probabilities is present. Use first start codon indicates whether the first possible start codon is used or not. If not, the function Choose_Start is called to choose the start codon. Currently it computes hybridization energy between the string Ribosome_Pattern and the region in front of the start codon, and if this is above a threshold, that start site is chosen. The ribosome pattern string can be set by the -s option. Presumably function Choose_Start should be modified to do something cleverer. Currently used start codons are atg, gtg & ttg . These can be changed in the function Is_Start , but corresponding changes should be made in Choose_Start . The next portion of the output is the result for each orf: Column 1 is an ID number for reference purposes. It is assigned sequentially starting with 1 to all orfs whose Gene Score is at least 90 . I'll make this a command-line option when I decide what letter to use. Column 2 is the reading frame of the orf. Three forward (F1, F2 and F3) and three reverse (R1, R2 and R3). These correspond with the headings for the scores in columns 9-14. Column 3 is the start position of the orf, i.e., the first base *after* the previous stop codon. Column 4 is the position of the first base of the first start codon in the orf. Currently I use atg, ctg, gtg and ttg as start codons. Column 5 is the position of the last base *before* the stop codon. Stop codons are taa, tag, and tga. Note that for orfs in the reverse reading frames have their start position higher than the end position. The order in which orfs are listed is in increasing order by Max {OrfStart, End}, i.e., the highest numbered position in the orf, except for orfs that "wrap around" the end of the sequence. Columns 6 and 7 are the lengths of the orf and gene, respectively, i.e., 1 + |OrfStart - End| and 1 + |GeneStart - End| . Column 8 is the score for the gene region. It is the probability (as a percent) that the Markov model in the correct frame generated this sequence. This value matches the value in the corresponding column of frame scores--an orf in reading frame R1 has a Gene Score equal to the value in the R1 column of frame scores for that orf. Columns 9-14 are the scores for the gene region in each of the 6 reading frames. It is the probability (as a percent) that the Markov model in that frame generated this sequence. Column 15 is the probability as a percent that the gene sequence was generated by a model of independent probabilities for each base, and represents to some extent the probability that the sequence is "random". When two genes with ID numbers overlap by at least a sufficient amount (as determined by Min_Olap and Min_Olap_Percent ), a line beginning with *** is printed and scores for the overlap region are printed. If the frame of the high score of the overlap region matches the frame of the longer gene, then a message is printed that the shorter gene is rejected. Otherwise, a message is printed that *both* genes are "suspect". A suspect or reject message for any gene is only printed once, however. A message is also printed if a gene with an ID number wholly contains another gene with an ID number. The longer "shadows" the shorter. At the end a list of "putative" gene positions is produced. The first column is the ID number, the second is the start position, the third is the end position. For "suspect" genes, a notation in [] 's follows: [Bad Olap a b c] means that gene number a overlapped this one and was shorter but scored higher on the overlap region. b is the length of the overlap region and c is the score of *this* gene on the overlap region. There should be a [Shorter ...] notation with gene a giving its score. [Shorter a b c] means that gene number a overlapped this one and was longer but scored lower on the overlap region. b is the length of the overlap region and c is the score of *this* gene on the overlap region. There should be a [Bad olap ...] notation with gene a giving its score. [Shadowed by a] means that this gene was completed contained as part of gene a 's region, but in another frame. [Delay by a b c d] means that this gene was tentatively rejected because of an overlap with gene b , but if the start codon is postponed by a positions, then this would be a valid gene. The start position reported for this gene includes the delay. c is the length of the overlap region that caused the rejection and d is the score in this gene's frame on that overlap region. [Weak] means that this gene did not meet the regular scoring threshold, but if the independent model were ignored, its score would be high enough. Should only occur if the -w option is used. [Vote] means that this gene did not meet the regular scoring threshold, but sufficiently many of its subranges had high enough scores to indicate it might be a gene. Note that a gene marked as rejected may appear in this list. This can occur if the gene that caused the rejection was itself rejected. The actual algorithm to produce the list is as follows: Consider the genes in decreasing order by length. If gene x is to be rejected because of an overlap with longer gene y that has not been rejected, then gene x is rejected and does not appear in the list. Otherwise, all notations for gene x that are not caused by rejected genes are reported. I think a "delayed" gene might incorrectly be listed as causing a problem by the part of it that was eliminated by the delay. Probably the remaining portion should be reinserted into the sorted list base on its now-shorter length, and any notations caused by it should be re-checked to see if they're affected by shortening the gene. Let's save this for the next version. Specifying Different Start and Stop Codons: To specify different sets of start and stop codons, modify the file gene.h . Specifically, the functions: Is_Forward_Start Is_Reverse_Start Is_Start Is_Forward_Stop Is_Reverse_Stop Is_Stop are used to determine what is used for start and stop codons. Is_Start and Is_Stop do simple string comparisons to specify which patterns are used. To add a new pattern, just add the comparison for it. To remove a pattern, comment out or delete the comparison for it. The other four functions use a bit comparison to determine start and stop patterns. They represent a codon as a 12-bit pattern, with 4 bits for each base, one bit for each possible value of the bases, T, G, C or A. Thus the bit pattern 0010 0101 1100 represents the base pattern [C] [A or G] [G or T]. By doing bit operations (& | ~) and comparisons, more complicated patterns involving ambiguous reads can be tested efficiently. Simple patterns can be tested as in the current code. For example, to insert an additional start codon of CAT requires 3 changes: 1. The line || (Codon & 0x218) == Codon should be inserted into Is_Forward_Start , since 0x218 = 0010 0001 1000 represents CAT. 2. The line || (Codon & 0x184) == Codon should be inserted into Is_Reverse_Start , since 0x184 = 0001 1000 0100 represents ATG, which is the reverse-complement of CAT. Alternately, the #define constant ATG_MASK could be used. 3. The line || strncmp (S, "cat", 3) == 0 should be inserted into Is_Start . If not automatically using the first start codon, some changes might also be made to the function Choose_Start . To compile the program: Use the Makefile. It will put the executables in a bin subdirectory. To compile just this program use: g++ glimmer2.c -lm -o glimmer Uses include files delcher.h context.h strarray.h gene.h To run the program: First run build-icm on a set of sequences to make the Markov models. build-icm train.model This will produce a file train.model. You can call this file anything you like, train.model, myicm, itsrainingtoday, etc. Then run glimmer2 glimmer2 hflu.seq train.model Options can be specified after the 2nd file name glimmer2 hflu.seq train.model Options are: -f Use ribosome-binding energy to choose start codon. This is not fully tested and likely to be buggy. Better not to use it. +f Use first codon in orf as start codon -g n Set minimum gene length to n -i s Ignore bases within the coordinates listed in file s. File s should consist of one base pair per line (no tags), and the ignore region should be a multiple of three bases long. [Somewhat buggy] -l Regard the genome as linear (not circular), i.e., do not allow genes to "wrap around" the end of the genome. This option works on both glimmer and long-orfs . The default behavior is to regard the genome as circular. -o n Set minimum overlap length to n. Overlaps shorter than this are ignored. -p n Set minimum overlap percentage to n%. Overlaps shorter than this percentage of *both* strings are ignored. -q n If using independent model scores (+r option), it will only apply to orfs shorter than n . The default value for n has an expectation of one orf that length or longer occurring per million bases in a random genome with the same gc content -r Don't use independent probability score column +r Use independent probability score column -s s Use string s as the ribosome binding pattern to find start codons. Not fully tested and known to have bugs. -t n Set threshold score for calling as gene to n. If the in-frame score >= n, then the region is given a number and considered a potential gene. -w n Use "weak" scores on potential genes at least n bases long. Weak scores ignore the independent model. -X Allow orfs extending off ends of sequence to be scored