Software


If you're looking for the CHESS human gene database, it is at ccb.jhu.edu/chess

Next-generation sequence alignment software

Arioc
Arioc is a GPU-accelerated DNA short-read aligner for WGS and WGBS reads. With high throughput (~1.5 to 2 million reads per second with the human reference genome in a 4-GPU computer), it is well suited to large-scale NGS data processing.
Bowtie
An ultrafast, memory-efficient short read aligner that aligns short DNA sequences to the human genome at a rate of about 25 million reads per hour on a typical desktop computer. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: 2.3 GB for the human genome. Bowtie and Bowtie2 were developed by Ben Langmead and are actively supported by his lab.
TopHat
A spliced alignment system for RNA-seq experiments. TopHat finds known and novel exon-exon splice junctions and is extremely fast due to its use of the Bowtie2 aligner. The last release, TopHat2, runs with either Bowtie1 or Bowtie2 and includes algorithms that significant enhance TopHat's sensitivity, particularly in the presence of pseudogenes. TopHat2 includes TopHat-Fusion as an option.
TopHat-Fusion
TopHat-Fusion is an enhanced version of TopHat with the ability to align reads across fusion points, which results from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome.
HISAT
HISAT is a highly efficient system for aligning RNA-seq reads. HISAT uses a novel indexing scheme, hierarchical indexing, which is inherently well-suited for aligning across introns. It employs two types of indexes for alignment: (1) a whole-genome FM index to anchor each alignment, and (2) numerous local FM indexes for very rapid extensions of these alignments. HISAT supports genomes of any size, including those larger than 4 billion bases.
HISAT2
HISAT2 is a new, rapid and accurate system for aligning NGS reads (both DNA and RNA) against a population of genomes. HISAT2 is a successor to both HISAT and TopHat2. HISAT2 extends the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index to incorporate genomic differences among individuals into the reference genome. HISAT2 is currently maintained at kim-lab.org.
HISAT-genotype
HISAT-genotype is a next-generation platform that enables rapid and accurate genomic analysis of our genomes using next-generation sequencing data on a desktop within a few hours. The platform currently supports HLA typing, discovery of novel HLA alleles, DNA fingerprinting analysis, and other functionalities. All HISAT programs were developed by Daehwan Kim and they are currently maintained at kim-lab.org.
Cufflinks
A transcript assembler and abundance estimator for RNA-seq data. Cufflinks assembles transcripts from the alignments produced by TopHat, including novel isoforms, and quantitates those transcripts. Cufflinks was originally developed by Cole Trapnell and is supported by his lab at the University of Washington.
StringTie
A fast and accurate transcript assembler and abundance estimator for RNA-seq data. Designed as a successor to Cufflinks, StringTie assembles transcripts from the alignments produced by TopHat2, HISAT, or other spliced aligners, and quantitates those transcripts.
TieBrush
A utility for efficient merging redundant information from multiple alignment files designed to enable rapid manupulation of extremely large datasets (RNA-seq, whole genome, exome, etc.). Data representations built with TieBrush and TieCov can be used for easier programmatic and visual analysis and comparison of groups within large sequencing datasets.
EASTR
EASTR is a tool for detecting spuriously spliced alignments and junctions in RNA-seq datasets and reference annotations. It improves the accuracy of downstream analyses, such as transcriptome assembly, by dentifying and removing misaligned spliced alignments. The tool can process GTF, BED, and BAM files as input.
Ballgown
A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression measurements. Ballgown is written in R and is part of Bioconductor.
CloudBurst
An older program for highly sensitive short read mapping using MapReduce. CloudBurst, developed by Michael Schatz (now a faculty member at JHU Computer Science) uses Hadoop to efficiently parallelize the short read mapping problem to dozens or hundreds of computers. This enables CloudBurst to execute highly sensitive read mappings with any number of mutations or indels.
Crossbow
Crossbow is an early scalable software pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, an accurate genotyper, within Hadoop to distribute and accelerate the computation with many nodes. In the CrossBow paper, we used it to analyze 35x coverage of a human genome in 3 hours for about $100 using a 40-node, 320-core cluster rented from Amazon's EC2 utility computing service.
Diamund Diamund is an efficient algorithm for variant detection in family trios or pairs of closely related exome or whole-genome sequencing samples. It compares DNA sequences directly to one another, without aligning them to the reference genome.
EDGE-pro EDGE-pro is a program for estimating gene expression from prokaryotic RNA-seq. EDGE-pro uses Bowtie2 for alignment but, unlike TopHat and Cufflinks, does not allow spliced alignments. It also handles overlapping genes, a common phenomenon in bacteria that is largely absent in eukaryotes.

Metagenomics Classification, Abundance Estimation and Visualization

How to Choose a Metagenomics Classifier
Kraken, KrakenUniq, Kraken2, and Centrifuge are all metagenomic classifiers developed by researchers in the Center for Computational Biology. To help users choose the best tool for their project, we provide this linked page as an explanation and comparison between tools along with descriptions of each author and their roles in the software development.

Kraken Kraken is a very fast system for taxonomic classification of short or long DNA sequences from a microbiome or metagenomic sample. See the 2014 Genome Biology paper here. NOTE: KrakenUniq is a newer, more capable version of Kraken1, and we strongly recommend that users upgrade to KrakenUniq or else Kraken2.
KrakenUniq KrakenUniq is update to Kraken 1 that runs as fast as Kraken and can work with the same databases, but additionally counts the number of unique k-mers using the stream sketching algorithm HyperLogLog. Using unique k-mers, the results can be filtered and ranked by the coverage of genomes in the database, instead of simple read counts. NEW! (May 2022): KrakenUniq has a new version developed by Christopher Pockrandt that can run on low-memory machines, including laptops, even with a huge database (hundreds of GB). It's also available for installation using bioconda, at https://anaconda.org/bioconda/krakenuniq.
Kraken2 Kraken2 is an improved version of Kraken, using the same classification algorithm but with improvements in speed and memory. Specifically, Kraken 2 have faster database build times, smaller database sizes, and faster classification speeds. Additional details are explained on the Kraken 2 webpage.
Centrifuge Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples, with better sensitivity than and comparable accuracy to other leading systems. Centrifuge requires a relatively small index (e.g., 4.3 GB for ~4,100 bacterial genomes).
Bracken Bracken statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
Pavian Pavian is a web application for exploring metagenomics classification results, with a special focus on infectious disease diagnosis.

Computational Gene Finding

A system that uses interpolated Markov models to find genes in microbial DNA. Used to annotate hundreds (possibly thousands) of bacterial, archaeal, and viral genomes. Current version is 3.02.
GlimmerHMM
A Generalized Hidden Markov Model gene-finder which makes use of the techniques implemented previously by GlimmerM.
Glimmer-MG
Glimmer-MG is a older system for finding genes in metagenomic shotgun DNA sequences, using the Glimmer algorithm plus the SCIMM system for clustering metagenomics data, and the now-outdated Phymm system for phylogenetic labeling.
GeneSplicer
A fast system for detecting splice sites in genomic DNA of various eukaryotes.
SIM4CC An accurate and efficient program to align cDNA sequences (mRNAs, ESTs) to genomic sequences, specifically designed for cross-species alignment.
sim4db / leaff Fast high-throughput spliced alignment (sim4, sim4cc) and sequence indexing.
A suite of programs for extracting, quantifying and comparing alternative splicing (AS) events from RNA-seq data.
A program that predicts gene models using the output from multiple sources of evidence, including other gene finders, Blast searches, and other alignment data.

Genome assembly and large-scale genome alignment

A system for aligning whole genomes, chromosomes, and other very long DNA sequences. MUMmer is also widely used for comparing genome assemblies. NOTE: MUMmer has been at sourceforge since the early 2000's, but was moved to Github with the release of MUMmer4 in 2017.
An early attempt to use GPUs for alignment, MUMmerGPU uses a technique called general-purpose GPU programming (GPGPU programming) to harness the extreme parallelism of GPUs for non-graphics tasks.
GAGE A realistic assessment of genome assembly software in a rapidly changing field of next-generation sequencing.
GAGE-B An evaluation of contiguity and accuracy of assemblies of bacterial organisms that are generated by some of most commonly used genome assemblers. GAGE-B follows the standards set by GAGE.
MaSuRCA MaSuRCA is a whole-genome assembler developed originally at the University of Maryland by James Yorke, Aleksey Zimin, and their colleagues. Ongoing development is a joint effort between JHU and UMD, and with recent modules designed to create hybrid assemblies using both short reads (Illumina) and long reads (PacBio/Oxford Nanopore).
AMOS Assembler project This is a set of tools, libraries, and freestanding genome assemblers, all open source. AMOS is an open consortium started at The Institute for Genomic Research (TIGR) that grew to include the University of Maryland, Johns Hopkins University, The Karolinska Institutet, the Marine Biological Laboratory, and others
AMOScmp
is a comparative genome assembler, which uses one genome as a reference on which to assemble another, closely related species.  See the journal paper here.
MINIMUS
A small, lightweight assembler for small jobs such as assembling a viral genome, assembling a set of reads that match a single gene, or other tasks that don't require the complex infrastructure of a large-genome assembler.
Hawkeye
A visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Can be used to interactively analyze assemblies from many popular assemblers on your desktop computer. See the journal paper here.
Quake A software package to detect and correct substitution sequencing errors in WGS data sets with deep coverage.
FLASH A fast, accurate program to increase the length of reads by overlapping and merging paired reads from fragments shorter than twice the length of reads. Primarily designed to merge Illumina paired reads.
Celera Assembler
A whole genome assembler originally developed at Celera Genomics for the assembly of the human genome.  CeleraAssembler is an open-source project at SourceForge.  The code has been actively maintained since 2005 by researchers at CBCB and the Venter Institute (formerly known as TIGR, The Institute for Genomic Research).
ABBA
Assembly Boosted By Amino acid sequence is a comparative gene assembler, which uses amino acid sequences from predicted proteins to help build a better assembly.  See the journal paper. Link for installation and more information..
AutoEditor A tool for correcting sequencing and basecaller errors using sequence assembly and chromatogram data from Sanger (1st generation) reads. On average, AutoEditor corrects 80% of erroneous base calls, with an accuracy of 99.99%.

Other sequence analysis tools

BRCA gene testing
a computational screening test that takes the raw DNA sequence data from a whole-genome sequence of an individual human and tests for each of 68 known mutations in the BRCA1 and BRCA2 genes.
DivE
a software to find regions that evolve at a slower or faster rate than the neutral evolution rate in any clade of a phylogeny of a set of very closely related species.
DupLoCut
A software which computes ancestral gene orders under the duplication-loss evolutionary model.
ELPH A motif finder based on Gibbs sampling that can find ribosome binding sites, exon splicing enhancers, or regulatory sites.
fqtrim
a software utility for filtering and trimming high-throughput next-gen reads.
GFF utilities
gffread: a program for filtering, converting and manipulating GFF files
gffcompare: a program for comparing, annotating, merging and tracking transcripts in GFF files
Insignia A comprehensive system for finding unique DNA sequences that can be used to identify any bacterial or virus species or strain. Currently has over 13,000 species and strains in its database..
Kraken A fast system for taxonomic classification of short or long metagenomic DNA sequences.
Centrifuge A very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.
PhymmBL A one-stop system for taxonomically classifying metagenomic short reads.
Software and a database of operons covering a large number of prokaryotic genomes.  Described in M. Pertea et al., Nucl. Acids Res 37 (2009), D479-D482.
rddChecker A program for determining sites of RNA-DNA differences (RDDs) and candidate RNA editing sites from RNA-seq data.
RepeatFinder an older system for finding and characterizing repetitive sequences in complete and partial genomes.
Scimm A tool for unsupervised clustering of metagenomic sequences using interpolated Markov models.
SEE ESE an online tool for identifying exon splicing enhancers (ESEs) in Arabidopsis and Drosophila.
A highly accurate program that finds rho-independent transcription terminators in bacterial genomes. The site includes a database with pre-computed predictions for hundreds of species.

Variant Analysis Tools

CHASM and SNVBox Software to predict the functional sigificance of somatic missense mutations observed in the genomes of cancer cells, and a database of pre-computed features of all possible amino acid substitutions at every position of the annotated human exome.
CRAVAT Cancer-related analysis of variants toolkit. Web tool for functional predictions and annotations of both somatic and germline variants.
FAST An application for genome-wide studies by efficiently running several gene based analysis methods simultaneously on the same data set.
LS-SNP/PDB Web tool for structural annotations and visualizations of missense variants in dbSNP.
muPIT Web tool for interactive structural annotations and visualizations of non-synonymous variation/mutation on proeins.

Other web servers and databases

CHESS A new catalog of human genes based on nearly 10,000 RNA sequencing experiments. For a full description of chess, see the paper in Genome Biology, here.
T2T-CHM13 Annotation RefSeq annotation of the CHM13 genome created using the Liftoff program
ARDB New in early 2009 Antibiotic Resistance Genes Database
Web servers for displaying alignments and annotations of bacterial genomes. 
A collection of links (now very old) to external sequence analysis programs.