Salzberg Lab Software

 
 

Computational Gene Finding and Metagenomics

  1. 1.Glimmer uses interpolated Markov models (IMMs) to find genes in microbial DNA. Used around the world for thousands of genomes.  Originally developed by Art Delcher and Steven Salzberg.

  2. 2.Kraken is a very fast system for identifying the species represented by short (or long) DNA sequences, usually obtained through microbiome or metagenomic studies. The Kraken project is led by my former student Derrick Wood.

  3. 3.Phymm and PhymmBL, first released in 2009, are systems for classifying short DNA sequences from metagenomics projects, labelling them with their likely species name. Originally developed by Arthur Brady.

  4. 4.JIGSAW, a program that predicts gene models using the output from multiple sources of evidence, including other gene finders, Blast searches, and other alignment data. Originally developed by Jonathan Allen.

  5. 5.GlimmerHMM, an interpolated Markov Model system for finding genes in many eukaryotes, including P. falciparum, A. thaliana, rice (O. sativa), mosquito (A. aegypti), B. malayi, C. neoformans, and others. Originally developed by Ela Pertea.

  6. 6.GeneZilla, a generalized HMM for eukaryotic gene finding developed by Bill Majoros, a former Salzberg lab member (when the lab was at TIGR).

  7. 7.GeneSplicer, a fast system for detecting splice sites in genomic DNA of various eukaryotes. Originally developed by Mihaela Pertea.

Genome assembly, next-gen sequence alignment, and whole genome alignment

  1. 1.Bowtie is an ultrafast system for aligning short reads from next-generation sequencers to the human genome and any other genome.  Bowtie2, which supports gapped alignments, longer reads, and is equally fast, appeared in late 2011. The Bowtie project is led by Ben Langmead.

  2. 2.MUMmer is a system for aligning whole genomes, chromosomes, and other very long DNA sequences.  It includes the Nucmer and Promer alignment tools.  MUMmerGPU is a GPU-accelerated version of the core MUMmer system.

  3. 3.Tophat is a fast splice junction mapper for RNA-Seq reads.  TopHat is independent of annotation, meaning it can find novel exons and splice sites even if they are missing from standard gene annotation.  TopHat was originally developed by Cole Trapnell. TopHat2, a major new version, was developed primarily by Daehwan Kim.

  4. 4.StringTie is a new and fast transcript assembler and abundance estimator for RNA-seq data. Similar to Cufflinks, StringTie assembles transcripts from the alignments produced by TopHat, identifying novel isoforms and estimating expression levels for all transcripts. The StringTie project is led by Ela Pertea.

  5. 5.Cufflinks assembles the reads from an RNA-seq experiment, producing full-length transcripts in multiple isoforms, quantitating the levels of expression of each gene and each isoform.  The Cufflinks project is led by my former student Cole Trapnell.

  6. 6.DIAMUND is an efficient algorithm for variant detection that compares DNA sequences directly to one another, without aligning them to the reference genome. When used on exome sequences from family trios, or to compare normal and diseased samples from the same individual, it produces a dramatically smaller list of candidate mutations than previous methods. Original developers: Steven Salzberg and Ela Pertea.

  7. 7.TopHat-Fusion is an enhanced version of TopHat with the ability to align reads across chromosomal fusion points, which results from the breakage and re-joining of different chromosomes, a common event in some tumors. Original developer: Daehwan Kim.

  8. 8.EDGE-pro aligns and quantitates transcript data from bacterial and archaeal RNA-seq experiments. Original developer: Tanja Magoc.

  9. 9.The AMOS Assembler project is a set of tools, libraries, and freestanding genome assemblers, all open source. AMOS is also an open consortium that we started at TIGR, and that now includes multiple institutions.

  10. 10.Hawkeye, a flexible graphical interface to genome assemblies from a variety of assemblers.  Original developers: Mike Schatz and Adam Phillippy. Read the paper.

  11. 11.AMOScmp is a comparative genome assembler, which uses one genome as a reference on which to assemble another, closely related species.  Original developers: Mihai Pop and Adam Phillippy. See the journal paper here.

  12. 12.Quake is a package to detect and correct substitution sequencing errors in whole-genome sequencing data sets with deep coverage, primarily for next-generation sequencing projects. Original developer: David Kelley.  Read the paper.

  13. 13.FLASH, Fast Length Adjustment of SHort reads, is a very fast program to merge paired-end reads that were sequenced from fragments that are shorter than twice the read length. Original developer: Tanja Magoc. Read the paper.

  14. 14.Minimus is a small, lightweight assembler for small jobs such as assembling a viral genome, assembling a set of reads from a single gene, or other tasks that don't require a large-genome assembler. Original developer: Daniel Sommer.  Read the paper.

  15. 15.Bambus was the first publicly available, standalone genome assembly scaffolder. It orders and orients contigs into scaffolds based on various types of linking information.  Mihai Pop's group subsequently released Bambus2.

  16. 16.AutoEditor, a tool for correcting sequencing and basecaller errors using sequence assembly and chromatogram data (for older, capillary-based sequencers). On average AutoEditor corrects 80% of erroneous base calls, with an accuracy of 99.99%.  Original developers: Pavel Gajer and Mike Schatz. Read the paper.

Transcription terminators, operons, and motif analysis tools

  1. 1.TransTermHP (updated in 2010), a program that finds rho-independent transcription terminators in bacterial genomes. Originally developed in 2000 by Maria Ermolaeva.  Re-designed and re-implemented in 2007 by Carl Kingsford.

  2. 2.OperonDB (updated in progress, 2015), results from our operon-finding software on a large number of prokaryotic genomes. Described in Pertea et al. 2009, OperonDB: a comprehensive database of predicted operons in microbial genomes. Originally developed in 2001 by Maria Ermolaeva.  Redesigned and re-implemented in 2008 by Mihaela Pertea.

  3. 3.ELPH, a motif finder that can find ribosome binding sites, exon splicing enhancers, or regulatory sites. Original developer: Mihaela Pertea.

  4. 4.SeeESE, an online tool for identifying exon splicing enhancers (ESEs) in Arabidopsis, Drosophila, and other species. Originally developers: Mihaela Pertea and Steven Mount.

  5. 5.Skewed oligomers from bacterial and archaeal genomes (described in Salzberg et al., Gene 217:1-2, 1998).  Get the source code.

Machine learning systems, pre-1995 and pre-computational biology

  1. 1.The OC1 decision tree system (source code included).  Originally developed by S.K. Murthy.

  2. 2.The PEBLS memory-based reasoning system (source code included).  Originally developed by Scott Cost.