Next-generation sequence alignment software

An ultrafast, memory-efficient short read aligner that aligns short DNA sequences to the human genome at a rate of about 25 million reads per hour on a typical desktop computer. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: 2.3 GB for the human genome. Bowtie and Bowtie2 were developed by Ben Langmead and are actively supported by his lab.
A spliced alignment system for RNA-seq experiments. TopHat finds known and novel exon-exon splice junctions and is extremely fast due to its use of the Bowtie2 aligner. The latest release, TopHat2, runs with either Bowtie1 or Bowtie2 and includes new algorithms that significant enhance TopHat's sensitivity, particularly in the presence of pseudogenes. TopHat2 includes TopHat-Fusion as an option.
TopHat-Fusion is an enhanced version of TopHat with the ability to align reads across fusion points, which results from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome.
HISAT is a new, highly efficient system for aligning RNA-seq reads. HISAT uses a new indexing scheme, hierarchical indexing, which is inherently well-suited for aligning across introns. It employs two types of indexes for alignment: (1) a whole-genome FM index to anchor each alignment, and (2) numerous local FM indexes for very rapid extensions of these alignments. HISAT supports genomes of any size, including those larger than 4 billion bases.
HISAT2 is a new, rapid and accurate system for aligning NGS reads (both DNA and RNA) against a population of genomes. HISAT2 is a successor to both HISAT and TopHat2. In this program, we extended the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index to incorporate genomic differences among individuals into the reference genome.
A transcript assembler and abundance estimator for RNA-seq data. Cufflinks assembles transcripts from the alignments produced by TopHat, including novel isoforms, and quantitates those transcripts. Cufflinks was originally developed by Cole Trapnell and is supported by his lab at the University of Washington.
A new and fast transcript assembler and abundance estimator for RNA-seq data. Similar to Cufflinks, StringTie assembles transcripts from the alignments produced by TopHat, including novel isoforms, and quantitates those transcripts.
A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression measurements. Ballgown is written in R and is part of Bioconductor.
A program for highly sensitive short read mapping using MapReduce. CloudBurst, developed by Michael Schatz (now a faculty member at JHU Computer Science) uses Hadoop to efficiently parallelize the short read mapping problem to dozens or hundreds of computers. This enables CloudBurst to execute highly sensitive read mappings with any number of mutations or indels.
Crossbow is a scalable software pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, an accurate genotyper, within Hadoop to distribute and accelerate the computation with many nodes. In the CrossBow paper, we used it to analyze 35x coverage of a human genome in 3 hours for about $100 using a 40-node, 320-core cluster rented from Amazon's EC2 utility computing service.
Diamund Diamund is a new, efficient algorithm for variant detection that compares DNA sequences directly to one another, without aligning them to the reference genome.
EDGE-pro EDGE-pro is a program for estimating gene expression from prokaryotic RNA-seq. EDGE-pro uses Bowtie2 for alignment but, unlike TopHat and Cufflinks, does not allow spliced alignments. It also handles overlapping genes, a common phenomenon in bacteria that is largely absent in eukaryotes.
Kraken Kraken is a very fast system for taxonomic classification of short or long DNA sequences from a microbiome or metagenomic sample. See the 2014 Genome Biology paper here.
Centrifuge Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples, with better sensitivity than and comparable accuracy to other leading systems. Centrifuge requires a relatively small index (e.g., 4.3 GB for ~4,100 bacterial genomes).

Computational Gene Finding

A system that uses interpolated Markov models to find genes in microbial DNA. Used to annotate hundreds (possibly thousands) of bacterial, archaeal, and viral genomes. Current version is 3.02.
A Generalized Hidden Markov Model gene-finder which makes use of the techniques implemented previously by GlimmerM.
A fast system for detecting splice sites in genomic DNA of various eukaryotes.
SIM4CC An accurate and efficient program to align cDNA sequences (mRNAs, ESTs) to genomic sequences, specifically designed for cross-species alignment.
sim4db / leaff Fast high-throughput spliced alignment (sim4, sim4cc) and sequence indexing.
A suite of programs for extracting, quantifying and comparing alternative splicing (AS) events from RNA-seq data.
A program that predicts gene models using the output from multiple sources of evidence, including other gene finders, Blast searches, and other alignment data.

Genome assembly and large-scale genome alignment

A system for aligning whole genomes, chromosomes, and other very long DNA sequences. MUMmer is also widely used for comparing genome assemblies. NOTE: MUMmer has been at sourceforge since the early 2000's, but in 2016 we are moving it to Github, and a new version, MUMmer4, will appear soon.
High throughput sequence alignment using Graphics Processing Units (GPUs). Uses a technique called general-purpose GPU programming (GPGPU programming) to harness the extreme parallelism of GPUs for non-graphics tasks. In this application, hundreds of query sequences are simultaneously aligned to a reference sequence, creating an order of magnitude speed up over the same alignment on the CPU.
GAGE A realistic assessment of genome assembly software in a rapidly changing field of next-generation sequencing.
GAGE-B An evaluation of contiguity and accuracy of assemblies of bacterial organisms that are generated by some of most commonly used genome assemblers. GAGE-B follows the standards set by GAGE.
MaSuRCA MaSuRCA is a whole-genome assembler developed originally at the University of Maryland by James Yorke, Aleksey Zimin, and their colleagues. Ongoing development is a joint effort between UMD and JHU, and new modules coming soon include methods to create hybrid assemblies using both Illumina and PacBio data.
AMOS Assembler project This is a set of tools, libraries, and freestanding genome assemblers, all open source. AMOS is an open consortium started at The Institute for Genomic Research (TIGR) that grew to include the University of Maryland, Johns Hopkins University, The Karolinska Institutet, the Marine Biological Laboratory, and others
is a comparative genome assembler, which uses one genome as a reference on which to assemble another, closely related species.  See the journal paper here.
A small, lightweight assembler for small jobs such as assembling a viral genome, assembling a set of reads that match a single gene, or other tasks that don't require the complex infrastructure of a large-genome assembler.
A visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Can be used to interactively analyze assemblies from many popular assemblers on your desktop computer. See the journal paper here.
Quake A software package to detect and correct substitution sequencing errors in WGS data sets with deep coverage.
FLASH A fast, accurate program to increase the length of reads by overlapping and merging paired reads from fragments shorter than twice the length of reads. Primarily designed to merge Illumina paired reads.
Celera Assembler
A whole genome assembler originally developed at Celera Genomics for the assembly of the human genome.  CeleraAssembler is an open-source project at SourceForge.  The code has been actively maintained since 2005 by researchers at CBCB and the Venter Institute (formerly known as TIGR, The Institute for Genomic Research).
Assembly Boosted By Amino acid sequence is a comparative gene assembler, which uses amino acid sequences from predicted proteins to help build a better assembly.  See the journal paper. Link for installation and more information..
AutoEditor A tool for correcting sequencing and basecaller errors using sequence assembly and chromatogram data from Sanger (1st generation) reads. On average, AutoEditor corrects 80% of erroneous base calls, with an accuracy of 99.99%.

Other sequence analysis tools

BRCA gene testing
a computational screening test that takes the raw DNA sequence data from a whole-genome sequence of an individual human and tests for each of 68 known mutations in the BRCA1 and BRCA2 genes.
a software to find regions that evolve at a slower or faster rate than the neutral evolution rate in any clade of a phylogeny of a set of very closely related species.
A software which computes ancestral gene orders under the duplication-loss evolutionary model.
ELPH A motif finder based on Gibbs sampling that can find ribosome binding sites, exon splicing enhancers, or regulatory sites.
a software utility for filtering and trimming high-throughput next-gen reads.
GFF utilities
gffread: a program for filtering, converting and manipulating GFF files
gffcompare: a program for comparing, annotating, merging and tracking transcripts in GFF files
Insignia A comprehensive system for finding unique DNA sequences that can be used to identify any bacterial or virus species or strain. Currently has over 13,000 species and strains in its database..
Kraken A fast system for taxonomic classification of short or long metagenomic DNA sequences.
Centrifuge A very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.
PhymmBL A one-stop system for taxonomically classifying metagenomic short reads.
Software and a database of operons covering a large number of prokaryotic genomes.  Described in M. Pertea et al., Nucl. Acids Res 37 (2009), D479-D482.
rddChecker A program for determining sites of RNA-DNA differences (RDDs) and candidate RNA editing sites from RNA-seq data.
RepeatFinder an older system for finding and characterizing repetitive sequences in complete and partial genomes.
Scimm A tool for unsupervised clustering of metagenomic sequences using interpolated Markov models.
SEE ESE an online tool for identifying exon splicing enhancers (ESEs) in Arabidopsis and Drosophila.
A highly accurate program that finds rho-independent transcription terminators in bacterial genomes. The site includes a database with pre-computed predictions for hundreds of species.

Variant Analysis Tools

CHASM and SNVBox Software to predict the functional sigificance of somatic missense mutations observed in the genomes of cancer cells, and a database of pre-computed features of all possible amino acid substitutions at every position of the annotated human exome.
CRAVAT Cancer-related analysis of variants toolkit. Web tool for functional predictions and annotations of both somatic and germline variants.
FAST An application for genome-wide studies by efficiently running several gene based analysis methods simultaneously on the same data set.
LS-SNP/PDB Web tool for structural annotations and visualizations of missense variants in dbSNP.
muPIT Web tool for interactive structural annotations and visualizations of non-synonymous variation/mutation on proeins.

Other web servers and databases

ARDB New in early 2009 Antibiotic Resistance Genes Database
Web servers for displaying alignments and annotations of bacterial genomes. 
A collection of links (now very old) to external sequence analysis programs.