The following 2 files:
genbank.coord
genbank.fasta
were used in the process of training the GlimmerM system.
genbank.coord has the following format:
sequence_1_name CDS
start_exon_1 end_exon_1
start_exon_2 end_exon_2
.......................
start_exon_n end_exon_n
sequence_2_name CDS complement
end_exon_m start_exon_m
.......................
end_exon_1 start_exon_1
.......................
Here the first gene is on the direct strand of the sequence 1, and has
n exons, while the second gene is on the reversed strand of sequence 2,
and has m exons.
genbank.fasta is a multifasta file with
the DNA sequences containing the genes described in genbank.coord.
Latest Training
pf_trainset_cDNAs : set of 39
GenBank accessions of cDNA sequences encoding full-length genes.
pf_trainset_genes
: set of 117 GenBank accessions of genomic sequences encoding full-length
genes.
training.exons
: set of GenBank accessions with exon coordinates that were experimentally
verified.
Some statistics on the pf_trainset_genes data set are:
Average length of gene:2446.2
Average length of exon:1265.9
Average length of intron:180.3
Max no of introns: 15 for gene PFPRIMSSU
Max length of introns: 922 for gene PFDNACPN
Min length of introns: 73 for gene PFPRIMSSU
Max length of single exons: 12681 for gene PFA245435
Min length of single exons: 294 for gene PFAHMGP
Max length of first exons: 8520 for gene PFARPI
Min length of first exons: 3 for gene PFAALD
Max length of internal exons: 1869 for gene PFU07706
Min length of internal exons: 42 for gene PFPRIMSSU
Max length of last exons: 9707 for gene AF312917
Min length of last exons: 44 for gene AF161264
Max length of gene: 12681 for gene PFA245435
Min length of gene: 294 for gene PFAHMGP
Max length of sequence spanning a gene: 12681