//  Copyright (c) 1997-2005 by Arthur Delcher and Steven Salzberg.
//  All our code is OPEN SOURCE - see the LICENSE file.  Feel free
//  to redistribute!

//  Version 1.02 revised 25 Feb 98 to ignore the independent
//  (random) model for long orfs.  The default
//  length for "long" in this case is set to the length at which
//  exactly 1 orf of this length would be expected per 1 million
//  bases given the gc content of the genome.  This value also can be
//  set by command-line option  -q .

//  Version 1.03 revised  8 Feb 99 to make it easier to specify
//  start and stop codons.

//  Version 1.04 revised  10 May 99 to add  -l  command-line switch
//  to both  glimmer  and  long-orfs  to regard genome as *NOT*
//  circular.  Default is to regard it as circular.
//  Version 2.0 uses a tree-based IMM as described in the references
//  given in the README file.  It also implements an extensive new
//  algorithm (see the paper) to adjust the start locations of genes
//  whose initial coordinates result in an overlap.

//     Version:  2.01  31 Jul 98
//                 Change probability model
//                 Simplify wraparounds
//                 Move start codons to eliminate overlaps
//                 Discount independent model scores when
//                    there are no overlaps
//                 Uses Harmon's model

//     Version:  2.03  9 Dec 2002
//               Include raw scores in output
//               Add strict option to use independent intergenic
//                 model that discounts stop codons
//               Add option to score each entry from a list of coordinates
//                 separately, without overlapping/voting rules

//     Version:  2.10  5 Feb 2003
//               Strict option to use independent intergenic
//                 model that discounts stop codons is only behaviour

//     Version:  2.11  18 Apr 2003
//               Change long-orfs to automatically compute the
//               optimal value of ORF length in order to maximize
//               the amount of training data.  
Program  glimmer  takes two inputs:  a sequence file (in FASTA format)
and a collection of Markov models for genes as produced by the program
build-icm .  It outputs a list of all open reading frames (orfs) together
with scores for each as a gene.

The first few lines of output specify the settings of various
parameter in the program:

 Minimum gene length is the length of the smallest fragment
   considered to be a gene.  The length is measured from the first base
   of the start codon to the last base *before* the stop codon.
   This value can be specified when running the program with the  -g  option.

 Minimum overlap length is a lower bound on the number of bases overlap
   between 2 genes that is considered a problem.  Overlaps shorter than
   this are ignored.

 Minimum overlap percent is another lower bound on the number of bases
   overlap that is considered a problem.  Overlaps shorter than this
   percentage of *both* genes are ignored.

 Threshold score is the minimum in-frame score for a fragment to be
   considered a potential gene.

 Use independent scores indicates whether the last column that scores each
   fragment using independent base probabilities is present.

 Use first start codon indicates whether the first possible start codon
   is used or not.  If not, the function  Choose_Start  is called to
   choose the start codon.  Currently it computes hybridization energy
   between the string  Ribosome_Pattern  and the region in front of
   the start codon, and if this is above a threshold, that start site
   is chosen.  The ribosome pattern string can be set by the  -s  option.
   Presumably function  Choose_Start  should be modified to do something
   cleverer.

   Currently used start codons are  atg, gtg & ttg .  These can be changed
   in the function  Is_Start , but corresponding changes should be
   made in  Choose_Start .


The next portion of the output is the result for each orf:

 Column 1 is an ID number for reference purposes.  It is assigned
   sequentially starting with  1  to all orfs whose  Gene Score  is
   at least  90 .  I'll make this a command-line option when I decide
   what letter to use.

 Column 2 is the reading frame of the orf.  Three forward (F1, F2 and F3)
   and three reverse (R1, R2 and R3).  These correspond with the headings
   for the scores in columns 9-14.

 Column 3 is the start position of the orf, i.e., the first base *after*
   the previous stop codon.

 Column 4 is the position of the first base of the first start codon in
   the orf.  Currently I use atg, ctg, gtg and ttg as start codons.

 Column 5 is the position of the last base *before* the stop codon.  Stop
   codons are taa, tag, and tga.  Note that for orfs in the reverse
   reading frames have their start position higher than the end position.
   The order in which orfs are listed is in increasing order by
   Max {OrfStart, End}, i.e., the highest numbered position in the orf,
   except for orfs that "wrap around" the end of the sequence.

 Columns 6 and 7 are the lengths of the orf and gene, respectively, i.e.,
   1 + |OrfStart - End|  and  1 + |GeneStart - End| .

 Column 8 is the score for the gene region.  It is the probability (as
   a percent) that the Markov model in the correct frame generated this
   sequence.  This value matches the value in the corresponding column
   of frame scores--an orf in reading frame R1 has a Gene Score equal to
   the value in the R1 column of frame scores for that orf.

 Columns 9-14 are the scores for the gene region in each of the 6 reading
   frames.  It is the probability (as a percent) that the Markov model in
   that frame generated this sequence.

 Column 15 is the probability as a percent that the gene sequence was generated
   by a model of independent probabilities for each base, and represents to
   some extent the probability that the sequence is "random".


When two genes with ID numbers overlap by at least a sufficient
amount (as determined by  Min_Olap  and  Min_Olap_Percent ), a line
beginning with  ***  is printed and scores for the overlap region
are printed.  If the frame of the high score of the overlap
region matches the frame of the longer gene, then a message is
printed that the shorter gene is rejected.  Otherwise, a message
is printed that *both* genes are "suspect".  A suspect or reject
message for any gene is only printed once, however.

A message is also printed if a gene with an ID number wholly contains another
gene with an ID number.  The longer "shadows" the shorter.


At the end a list of "putative" gene positions is produced.  The first
column is the ID number, the second is the start position, the third
is the end position.  For "suspect" genes, a notation in  [] 's follows:

 [Bad Olap  a  b  c]  means that gene number  a  overlapped this one and
   was shorter but scored higher on the overlap region.   b  is the length
   of the overlap region and  c  is the score of *this* gene on the overlap
   region.  There should be a  [Shorter ...]  notation with gene  a
   giving its score.

 [Shorter  a  b  c]  means that gene number  a  overlapped this one and
   was longer but scored lower on the overlap region.   b  is the length
   of the overlap region and  c  is the score of *this* gene on the overlap
   region.  There should be a  [Bad olap ...]  notation with gene  a
   giving its score.

 [Shadowed by  a]  means that this gene was completed contained as part
   of gene  a 's region, but in another frame.

 [Delay by  a  b  c  d]  means that this gene was tentatively rejected
   because of an overlap with gene  b , but if the start codon is postponed
   by  a  positions, then this would be a valid gene.  The start position
   reported for this gene includes the delay.   c  is the length of the overlap
   region that caused the rejection and  d  is the score in this gene's frame
   on that overlap region.

 [Weak]  means that this gene did not meet the regular scoring threshold,
   but if the independent model were ignored, its score would be high
   enough.  Should only occur if the -w option is used.

 [Vote]  means that this gene did not meet the regular scoring threshold,
   but sufficiently many of its subranges had high enough scores to
   indicate it might be a gene.

Note that a gene marked as rejected may appear in this list.  This can
occur if the gene that caused the rejection was itself rejected.  The
actual algorithm to produce the list is as follows:

  Consider the genes in decreasing order by length.  If gene  x  is to
  be rejected because of an overlap with longer gene  y   that has not been
  rejected, then gene  x  is rejected and does not appear in the list.
  Otherwise, all notations for gene  x  that are not caused by rejected
  genes are reported.

I think a "delayed" gene might incorrectly be listed as causing a problem
by the part of it that was eliminated by the delay.  Probably the remaining
portion should be reinserted into the sorted list base on its now-shorter
length, and any notations caused by it should be re-checked to see if
they're affected by shortening the gene.  Let's save this for the next
version.


Specifying Different Start and Stop Codons:

To specify different sets of start and stop codons, modify the file
gene.h .  Specifically, the functions:

   Is_Forward_Start     Is_Reverse_Start     Is_Start
   Is_Forward_Stop      Is_Reverse_Stop      Is_Stop

are used to determine what is used for start and stop codons.

Is_Start  and  Is_Stop  do simple string comparisons to specify
which patterns are used.  To add a new pattern, just add the comparison
for it.  To remove a pattern, comment out or delete the comparison
for it.

The other four functions use a bit comparison to determine start and
stop patterns.  They represent a codon as a 12-bit pattern, with 4 bits
for each base, one bit for each possible value of the bases, T, G, C
or A.  Thus the bit pattern  0010 0101 1100  represents the base
pattern  [C] [A or G] [G or T].  By doing bit operations (& | ~) and
comparisons, more complicated patterns involving ambiguous reads
can be tested efficiently.  Simple patterns can be tested as in
the current code.

For example, to insert an additional start codon of CAT requires 3 changes:
1. The line
        || (Codon & 0x218) == Codon
   should be inserted into  Is_Forward_Start , since 0x218 = 0010 0001 1000
   represents CAT.
2. The line
        || (Codon & 0x184) == Codon
   should be inserted into  Is_Reverse_Start , since 0x184 = 0001 1000 0100
   represents ATG, which is the reverse-complement of CAT.  Alternately,
   the #define constant  ATG_MASK  could be used.
3. The line
        || strncmp (S, "cat", 3) == 0
   should be inserted into  Is_Start .
If not automatically using the first start codon, some changes might
also be made to the function  Choose_Start .


To compile the program:

  Use the Makefile.  It will put the executables in a  bin  subdirectory.

  To compile just this program use:

      g++ glimmer2.c -lm -o glimmer

  Uses include files  delcher.h  context.h  strarray.h  gene.h


To run the program:

  First run  build-icm  on a set of sequences to make the Markov models.

      build-icm <train.seq >train.model

  This will produce a file  train.model.  You can call this file anything
  you like, train.model, myicm, itsrainingtoday, etc.

  Then run  glimmer2

      glimmer2 hflu.seq train.model

  Options can be specified after the 2nd file name

      glimmer2 hflu.seq train.model <options>

  Options are:
    -f     Use ribosome-binding energy to choose start codon.  This is
	   not fully tested and likely to be buggy.  Better not to use it.
    +f     Use first codon in orf as start codon
    -g n   Set minimum gene length to n
    -i s   Ignore bases within the coordinates listed in file s. File s
	   should consist of one base pair per line (no tags), and the ignore
           region should be a multiple of three bases long.  [Somewhat buggy]
    -l     Regard the genome as linear (not circular), i.e., do not allow
           genes to "wrap around" the end of the genome.
           This option works on both  glimmer  and  long-orfs .
           The default behavior is to regard the genome as circular.
    -o n   Set minimum overlap length to n.  Overlaps shorter than this
           are ignored.
    -p n   Set minimum overlap percentage to n%.  Overlaps shorter than
           this percentage of *both* strings are ignored.
    -q n   If using independent model scores (+r option), it will only
           apply to orfs shorter than  n .  The default value for  n
           has an expectation of one orf that length or longer occurring
           per million bases in a random genome with the same gc content
    -r     Don't use independent probability score column
    +r     Use independent probability score column
    -s s   Use string s as the ribosome binding pattern to find start codons.
	   Not fully tested and known to have bugs.
    -t n   Set threshold score for calling as gene to n.  If the in-frame
           score >= n, then the region is given a number and considered
           a potential gene.
    -w n   Use "weak" scores on potential genes at least n bases long.
           Weak scores ignore the independent model.
    -X     Allow orfs extending off ends of sequence to be scored