Research software

Next-generation sequence analysis

StringTie is a fast transcript assembler and abundance estimator for RNA-seq data. StringTie assembles transcripts from the alignments produced by TopHat, including novel isoforms, and quantitates those transcripts.

TieBrush is an efficient method for merging redundant information from multiple alignment files, designed to enable rapid manipulation of extremely large sequencing datasets, such as RNA, whole-genome, exome and other types.

Diamund is a new, efficient algorithm for variant detection that compares DNA sequences directly to one another, without aligning them to the reference genome.

We have also developped a computational screening test that takes the raw DNA sequence data from a whole-genome sequence of an individual human and tests for each of 68 known mutations in the BRCA1 and BRCA2 genes.

Splice site prediction

GeneSplicer is a fast and accurate splice site predictor program. It implements an algorithm that combines decision trees and Markov models to capture the statistical properties surrounding the splice site junction.

GeneSplicerESE is a more accurate version of GeneSplicer for the plant Arabidopsis thaliana. In this new improved splice site recognition code we have developed a new computational technique to identify significantly conserved motifs involved in splice site regulation. In collaboration with Steve Mount, we identified putative exonic splicing enhancer hexamers in Arabidopsis thaliana. Then we used the Gibbs sampling program ELPH to locate conserved motifs represented by these hexamers in exonic regions near splice sites in confirmed genes. The integration of these regulatory motifs into GeneSplicer significantly improved the ability of the software to correctly predict splice sites in a large database of confirmed genes.

Gene finding

GlimmerHMM is a Generalized Hidden Markov Model ab initio gene-finder that incorporates GeneSplicer's splice site models, and uses the same Interpolated Markov models (IMMs) as GlimmerM to distinguish coding DNA sequence. Additionaly it incorporates open-source training modules that make GlimmerHMM one of the few gene finders outside the bacterial realm where scientists or engineers other than the inventor could re-train the system and make it work for new species. GlimmerHMM has been trained for recognizing genes in many species, including human.

Motif finding

Motif detection techniques are an essential component in gene finding softwares like GeneSplicer and GlimmerM/GlimmerHMM, but they can also be used to solve other problems in computational biology. ELPH is a general-purpose Gibbs sampler for finding motifs in a set of DNA or protein sequences. The typical problem that ELPH has to solve is equivalent to the discovery of a word that appears at different unknown positions in a large set of sequences.

Operon prediction

Comparison of complete microbial genomes reveals a large number of conserved gene clusters - sets of genes that have the same order in two or more different genomes. Such gene clusters often, but not always represent a co-transcribed unit, or operon. We developed a method to detect and analyze conserved gene pairs - pairs of genes that are located close on the same DNA strand in two or more bacterial genomes, and for each such pair we estimated a probability that the genes belong to the same operon. All operon predictions determined with this algorithm are integrated into the OperonDB database.

Study of evolutionary events

The history of evolutionary events relating a group of species can often be discerned by searching for patterns in the aligned DNA sequences from those species. Detrimental mutations are quickly lost due to negative (or purifying) selection, creating regions of strong sequence conservation, while positive selection can appear as regions with many more mutations than expected. Different types of selection can act on particular lineages of a phylogeny. Although methods have been developed to identify regions under selection across species or in a particular lineage, little attention has been given to identifying selection pressure on both coding and non-coding sequences for any branch of a phylogeny. DivE is a method that can identify either positive or negative selection on any lineage in a phylogeny.