PhymmBL

Note on support

Until further notice, PhymmBL is provided as-is. I will continue to develop the program (including incorporating useful suggestions and fixes sent in by users) as I find time, but my current job doesn't provide a whole lot of leeway for that, so it'll be sporadic and drawn-out.

Also until further notice, I'm suspending individual user support. I'm humbled and gratified that people continue to use the program, but I'm not doing anyone any favors by making them wait 3-4 months for an answer (roughly my current turnaround time). If anyone would like to start a support forum for the software -- it's fully open-source, so the more experienced programmers out there can probably help stuck users at least as well as I can -- I thoroughly endorse the idea. I ask only that you tell me about any such forums as they go live, so that I can point future support requests there (and so I can make sure there aren't redundant forums popping up in different places).

My apologies to any users who've emailed me between now and the last time I had some extra time to devote to development and support (around September 2012). Your requests are still in my queue, but I can't predict when I'll be able to attend to them.

-Arthur
2013.03.13

About PhymmBL

Metagenomics sequencing projects collect samples of DNA from uncharacterized environments that may contain hundreds or even thousands of species. One of the main challenges in analyzing a metagenome is phylogenetic classification of raw sequence reads into groups representing the same or similar species. Such classification is a useful prerequisite for genome assembly and for analysis of the biological diversity present in a sample. The newest sequencing technologies have simultaneously made metagenomics easier, by making the sequencing process faster, and more difficult, by producing shorter read lengths than previous technologies. Methods for classifying sequences as short as 100 base pairs (bp) have until now been relatively inaccurate, requiring metagenomics projects to use older, long-read technologies. Phymm, a new classification approach for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, can accurately classify reads as short as 100 bp. Its accuracy for short reads represents a significant leap forward over previous composition-based classification methods. PhymmBL (rhymes with "thimble"), the hybrid classifier included in this distribution which combines analysis from both Phymm and BLAST, produces even higher accuracy.

VERSION HISTORY

PhymmBL v4.0 is the current stable release.

> New in v4.0 [2012.08.31.1640]:

A fleet of across-the-board stability upgrades and minor code changes.
Enabled detailed logging of all levels of operation for easier progress tracking & troubleshooting.
Support has been enabled for the BLAST+ applications, making the BLAST portion of PhymmBL's processing pipeline significantly faster.
A file descriptor/redirection complaint specific to Ubuntu has been identified and fixed. Thanks to David Kelley for first pointing it out.
The setup process has been made substantially more robust against errors and inconsistencies in RefSeq's GenBank-encoded metadata and NCBI's taxonomic trees.
A rare bug in the BLAST database-rebuilding components of several scripts has been identified and eliminated.

> New in v3.2 [2011.02.23.1546]:

Custom genome data can now be added in batches instead of having to add one organism at a time. See the README for details and instructions on using the new batch mode.

> New in v3.1 [2010.10.18.1651]:

Reconfigured raw Phymm output format to deliver a huge reduction in file size.
Fixed a rare inifite loop potential in rebuildBlastDB.pl.

> New in v3.01 [2010.09.17.1300]:

Fixed a minor bug in addCustomGenome.pl that occasionally resulted in the loss of taxonomic metadata for new organisms.

> New in v3.0 [2010.06.25.1425]:

Confidence scores are now listed in the PhymmBL results files, translating raw scores into usable estimates of predictive accuracy. Please see the README for an important discussion of how to interpret and work with these scores.
Date stamps are now given in each phase of PhymmBL's terminal output to let users know how long each phase of analysis has taken.
ICM IDs are now listed in the raw Phymm output to allow for disambiguation between ICM scores assigned by different ICMs within the same species.

> New in v2.03 [2010.06.11.1327]:

Semicolons in species/strain names are now handled properly with respect to local database directory structure.
The database of known GenBank taxonomic-labeling inconsistencies has been updated.
A workaround has been added for kernels that complain when the 'cat' command is passed too many arguments, which can affect the construction of the local BLAST database. (If you didn't see an error during setup, you don't have to worry about this.)
A section has been added to the README with suggestions on incorporating mate-pair information into your classification run.

> New in v2.02 [2010.06.07.1246]:

A bug in addCustomGenome.pl preventing full assimilation of new genomes has been corrected. If you have any of the 2.x versions, and you attempted to add your own genomic data, check your PHYMM_DIR/.genomeData/.userAdded/ADDED_ORGANISM/ directory; if it doesn't contain any .icm files, please redownload the Phymm installer and add your custom genomes again. You will not need to regenerate the core RefSeq libraries or alter the main genomic database in any way.
A bug in the new-copy RefSeq download subroutine has been fixed. If you installed one of the 2.x versions for the first time and no genome data appeared, this version will correct the problem. Timeouts for RefSeq downloads have been extended, and the interface for addCustomeGenome.pl has been tweaked to make the taxonomic data entry a little clearer.

> New in v2.01 [2010.05.27.1634]:

The README has been substantially expanded to include instructions on parallelization, notes on interpreting PhymmBL's numeric scores, and several other minor changes. Program code has not been changed. Thanks to Liam Elbourne for helpful discussions.

> New in v2.0 [2010.05.25.1335]:

A script has been added allowing users to add their own custom genomic sequence data to the local database. The script takes new sequence data (as FASTA/multiFASTA files), adds them to the BLAST database, and creates IMMs to model them. The user is polled to provide taxonomic data for each new organism.
The setup script has been completely rewritten; users can now choose whether to download a completely new copy of the RefSeq microbial database, or to update the existing local database with only RefSeq sequences which have been added or have changed since the last install. (Genome content, taxonomic information and model files for user-added organisms are stored separately from the RefSeq data, so updates won't affect any custom content.)
A script has been added to manually regenerate the local BLAST database.
PhymmBL's combined scoring function was tweaked for the case of BLAST's E-value being reported as "0.0", resulting in slightly better overall accuracy in this case.
A collection of minor bugs and irritations has been fixed.
Mac OS is now formally supported, but please see the README for a note on obtaining wget, which isn't provided with the OS X suite of developer tools and is needed for setup to run properly.

Accuracy

Because one of the main challenges of metagenomic analysis is the fact that species are frequently encountered which have never before been sequenced, we examined the performance of this system using increasingly less data from organisms related to those from which query reads were sampled. The table below summarizes predictive accuracy results from PhymmBL, the hybrid method incorporating information from both Phymm and BLAST.

For instance, the information in the cell indexed by "Family excluded" and "Phylum" means that when, for each query read in our test set, all organisms belonging to the same family as the organism from which that read was sampled were excluded from consideration -- i.e., when the best possible prediction is one made at the order level -- PhymmBL was able to predict the correct phylum of query reads 57.5% of the time, with a standard deviation (measured over 10 runs) of ± 0.6%.

Note that accuracy, as reported in the table below, is measured as the percentage of all 100-bp query reads in the test data that received a correct label; no reads are left unlabeled.

Please see the paper for details on these and other experiments. All synthetic test data used for the experiments described in the paper (10 sets of 100-bp reads, plus one set each containing reads of 200, 400, 800 and 1000 bp) can be downloaded here.

	Species	Genus	Family	Order	Class	Phylum
All matches allowed	95.4 ± 0.2	99.1 ± 0.1	99.7 ± 0.1	99.8 ± 0.1	99.9 ± 0.1	99.9 ± 0.0
Species excluded	---	58.5 ± 0.6	63.7 ± 0.6	66.3 ± 0.6	71.0 ± 0.5	76.8 ± 0.8
Genus excluded	---	---	26.9 ± 0.6	33.0 ± 0.6	44.6 ± 0.6	63.4 ± 0.6
Family excluded	---	---	---	19.3 ± 0.5	33.4 ± 0.5	57.5 ± 0.6
Order excluded	---	---	---	---	23.8 ± 0.5	53.2 ± 0.6
Class excluded	---	---	---	---	---	43.5 ± 0.7

PhymmBL percent prediction accuracy and standard deviations for classification experiments with 100-bp reads and different clade levels excluded from comparison.

Obtaining the Software

This software is OSI Certified Open Source Software.

Click to download the PhymmBL installation software as either a gzipped tarball or as a .zip file.

After downloading, move the downloaded file into a directory in which you intend to store the PhymmBL program files and downloaded genomic data, then uncompress it by typing

tar zxvf phymmbl_installer.tar.gz

PhymmBL's subdirectory structure will be created in your target directory, as will the installer script and a README file with instructions on building and using the system.

The software was developed and tested on a multi-core Linux system; it is expected to work properly on any Unix-like system which meets its system requirements (see the README for details, including an extra step Mac OS users will need to take).

PLEASE NOTE: Setup is particularly computationally intensive: even on a relatively powerful server, you should expect ground-up installation to take at least 24 hours.

References

A. Brady and S. L. Salzberg: PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nature Methods Vol. 8, No. 5, p. 367 (May 2011)

A. Brady and S. L. Salzberg: Phymm and PhymmBL: Phylogenetic Classification of Metagenomic Data with Interpolated Markov Models. Nature Methods Vol. 6, No. 9, pp. 673-676 (September 2009)

Funding

This work is supported in part by NIH grant R01-LM006845 to S.L. Salzberg.