Next-generation sequence analysis

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include alignments of either short or long RNA-seq reads or a combination of both.

SPIT is statistical tool that quantifies the heterogeneity in transcript usage within a population and identifies predominant subgroups along with their distinctive sets of DTU events.

EASTR is a tool for detecting and eliminating spuriously spliced alignments in RNA-seq datasets.

TieBrush is an efficient method for merging redundant information from multiple alignment files, designed to enable rapid manipulation of extremely large sequencing datasets, such as RNA, whole-genome, exome and other types.

Diamund is a new, efficient algorithm for variant detection that compares DNA sequences directly to one another, without aligning them to the reference genome.

The BRCA diagnostic software is a computational screening test that takes the raw DNA sequence data from a whole-genome sequence of an individual human and tests for each of 68 known mutations in the BRCA1 and BRCA2 genes.

Splice site prediction

Splam is a splice site predictor utilizing a deep residual convolutional neural network for fast and accurate evaluation of splice junctions solely based on 400nt DNA sequences around donor and acceptor sites.

GeneSplicer is a fast and accurate splice site predictor program. It implements an algorithm that combines decision trees and Markov models to capture the statistical properties surrounding the splice site junction.

GeneSplicerESE is a more accurate version of GeneSplicer for the plant Arabidopsis thaliana, using SeeEse, a computational technique to identify significantly conserved motifs involved in splice site regulation.

Gene finding

CHESS is a comprehensive set of human genes based on nearly 10,000 RNA sequencing experiments produced by the GTEx project.

GlimmerHMM is a Generalized Hidden Markov Model ab initio gene-finder that incorporates GeneSplicer's splice site models, and uses the same Interpolated Markov models (IMMs) as GlimmerM to distinguish coding DNA sequence. Additionaly it incorporates open-source training modules that make GlimmerHMM one of the few gene finders outside the bacterial realm where scientists or engineers other than the inventor could re-train the system and make it work for new species. GlimmerHMM has been trained for recognizing genes in many species, including human.

Motif finding

Motif detection techniques are an essential component in gene finding softwares like GeneSplicer and GlimmerM/GlimmerHMM, but they can also be used to solve other problems in computational biology. ELPH is a general-purpose Gibbs sampler for finding motifs in a set of DNA or protein sequences. The typical problem that ELPH has to solve is equivalent to the discovery of a word that appears at different unknown positions in a large set of sequences.

Operon prediction

Comparison of complete microbial genomes reveals a large number of conserved gene clusters - sets of genes that have the same order in two or more different genomes. Such gene clusters often, but not always represent a co-transcribed unit, or operon. We developed a method to detect and analyze conserved gene pairs - pairs of genes that are located close on the same DNA strand in two or more bacterial genomes, and for each such pair we estimated a probability that the genes belong to the same operon. All operon predictions determined with this algorithm are integrated into the OperonDB database.

Study of evolutionary events

The history of evolutionary events relating a group of species can often be discerned by searching for patterns in the aligned DNA sequences from those species. Detrimental mutations are quickly lost due to negative (or purifying) selection, creating regions of strong sequence conservation, while positive selection can appear as regions with many more mutations than expected. Different types of selection can act on particular lineages of a phylogeny. Although methods have been developed to identify regions under selection across species or in a particular lineage, little attention has been given to identifying selection pressure on both coding and non-coding sequences for any branch of a phylogeny. DivE is a method that can identify either positive or negative selection on any lineage in a phylogeny.