SYSTEM REQUIREMENTS: - You'll need a local copy of the standalone BLAST software; this can be downloaded for your local platform from ftp://ftp.ncbi.nih.gov/blast/executables/release/ PhymmBL only formally supports versions up to 2.2.22; the newer BLAST+ applications are not supported. - A minimum of 120GB of disk space should be available for the default installation; 200GB is recommended, especially if you're adding your own custom sequence data to the local database (see below). - 4GB of free memory is preferred; 1GB should suffice; 500MB or less will likely cause problems. - You'll need to have a relatively modern version of Perl installed, including the Net::FTP package. - Your system will need to have the (free) UNIX utility "wget" installed. This utility is included in many Linux/UNIX distributions, but please note that it isn't included with Mac OS (even in the developer toolkit); the website for the software is http://www.gnu.org/software/wget/ INSTALLATION: To install, run "phymmSetup.pl" from this directory and follow the prompts. The installer will download all the current bacterial & archaeal genomic data from RefSeq, set up needed metadata, and build the IMMs that comprise the core of the Phymm system. It'll also build a local BLAST database representing all the genomic data from RefSeq. ADDING YOUR OWN GENOMIC DATA: To add your own genomic data (input as sequences in FastA/ multi-FastA format) to Phymm's local database, run "customGenomicData.pl" from the installation directory. This script has two modes: single-organism mode and batch mode. Single-organism mode: As the label implies, you can only add one organism at a time using this mode. Note that you *can* add multiple sequences at once, as long as they're all from the same organism. To use this mode, just run "customGenomicData.pl" with no arguments. Batch mode: This mode allows you to add multiple organisms at once. First, create a configuration file that will tell the script the locations of FastA-formatted sequence files to be used as input, as well as taxonomic metadata for the organisms to be added. The configuration file format requires one line per file location*, with nine fields per line, separated by tab characters. *"File location" can refer either to a single source file OR to a directory containing multiple FastA files all belonging to the same organism. If you specify a directory, all files in that directory will be incorporated into the models built for the current organism. An example configuration file is provided at http://www.cbcb.umd.edu/software/phymm/exampleConfig.txt . The nine fields on each line are: 1. File location. This can be either an absolute path or a path relative to your PhymmBL installation directory. 2. "Combine files" flag: the value of this field is either 0 or 1. If 0, a separate IMM will be built for each input file, leading (assuming you have multiple input files) to multiple models for your organism, each representing a different portion of your organism's genome. If 1, one single IMM will be built from all of the specified sequence data for your organism. Anecdotally, we can advise you that option 0 occasionally leads to slightly better accuracy, especially in the case of large organisms like eukaryotes. On the other hand, more models increases processing time, which will slow down your analysis runs. A synthetic classification experiment in which results from a set of about 250 models of different segments of the human genome were compared to results from a single model of the entire human genome produced an improvement in accuracy that was less than 2%. Bottom line: we recommend using a value of 1 in this field unless you have a truly compelling reason to do otherwise. 3-9. Phylum, class, order, family, genus, species and strain labels, respectively, for your organism. IMPORTANT: If there's no entry in the taxonomy for your organism at a certain clade level, enter NO_VALUE in the field corresponding to that level. DO NOT LEAVE ANY FIELDS BLANK. Cf. the example config file (linked above), in which a few field entries have been (artificially) replaced with NO_VALUE entries. ALSO IMPORTANT: The "species" field is meant to contain ONLY the species-level label for your organism, e.g. "coli" for E. coli or "sapiens" for Homo sapiens. If you enter the entire (genus+species) binomial name for your organism in this field, then classification will be performed in exactly the same way as it otherwise would be, but your result files will look strangely redundant, because the software automatically reconstructs binomial names from the values listed in the individual fields, and you (presumably) don't want to be modeling "Salmonella Salmonella typhimurium" or "Escherichia Escherichia coli." UPDATING YOUR DATABASE WITH NEW REFSEQ CONTENT: Re-run "phymmSetup.pl" and follow the prompts; if you so choose, you'll be able to update your local genomic database with only those RefSeq sequences that have been added or changed since your last update. You can also completely overwrite your local database with the current copy of the RefSeq collection, if you like. Note that this won't affect any custom user-added content. MANUALLY REBUILDING THE BLAST DATABASE: Run "rebuildBlastDB.pl" from this directory. This can be useful if, for example, you've added multiple organisms with multiple runs of "addCustomGenome.pl", declining each time to automatically regenerate the local BLAST database, in favor of doing it just once when you've finished adding the entire batch of new organisms. TO CLASSIFY READS: With your set of query reads saved into a multi-FastA file (make sure each read has a *unique* identifier immediately following the '>' on its comment line), run "scoreReads.pl ". Result files will be generated for Phymm, BLAST (using the local database constructed during installation), and PhymmBL (a weighted combination of the two designed to maximize accuracy). TAKING ADVANTAGE OF MATE PAIRS: Providing more sequence information to Phymm will result in more accurate classification; if you have mate-pair data, we recommend concatenating each pair, with a string of 5 or 10 'N's linking the two. PARALLELIZATION: The Phymm portion of the PhymmBL package is currently a single-processor program. If you have multiple processors available and wish to parallelize your classification runs, scoreReads.pl is designed to allow multiple copies to run in parallel; you can split your input data (FastA sequences) into as many files as your system can comfortably handle, then run scoreReads.pl separately, in parallel, on each one. This will substantially improve processing time. Example scenario: Your sequence data is in a file called "queryReads.fasta". You have six processors available on your system, and you want to use, say, four of them for classification. Break up "queryReads.fasta" into four roughly equally-sized multi-FastA files, say "queryReads_1.fasta", "queryReads_2.fasta", etc., one for each processor to be used. Then run scoreReads.pl separately on each of the four input chunks, e.g.: nohup scoreReads.pl queryReads_1.fasta & nohup scoreReads.pl queryReads_2.fasta & nohup scoreReads.pl queryReads_3.fasta & nohup scoreReads.pl queryReads_4.fasta & The 'nohup' keyword at the beginning of each line directs the system not to kill PhymmBL if you log out or you lose your terminal connection before the process is finished; this is recommended, wherever possible, when launching long-running processes like PhymmBL, since you *really* don't want to have to start over if something ends up going wrong with your connection that wouldn't otherwise have affected the outcome. The '&' character at the end will send each process to the background, so you can regain control of your terminal before the process is finished, which is necessary if you don't want to open four separate terminals. The only drawback to this, if you log out after starting the process, is that you'll have to manually check to see whether or not scoreReads.pl has finished; it's done when the PhymmBL output has appeared, which is contained in files named "results.03.combined_[INPUT_FILE_NAME].txt". (If you do remain logged in after launching PhymmBL, your terminal will send you a message when each chunk has been completed.) CONFIDENCE SCORES: PhymmBL output ("results.03.*") now contains confidence scores along with taxonomic predictions. One score, a real number between 0 and 1, is given for each clade level prediction (genus and above), under the headers "GENUS_CONF," "FAMILY_CONF," etc. CRITICAL NOTE: the scores are NOT p-values, and should not be directly interpreted as such. What they *do* provide is a bounded, linearly-scaled confidence estimate: a prediction with a confidence score of 0.4 is roughly half as confident as one with a confidence score of 0.8. The scores are the values of polynomial functions which have been fitted to experimental accuracy results, with one fitted function for each clade level. If we label the following statements concerning a single query read as shown: A: The query read length is between 100 bp and 1000 bp, inclusive. B: The exact /species/ from which the read has been sampled is *not* in the PhymmBL database. C: Some other member of the /genus/ of the organism from which the read has been sampled *is* in the PhymmBL database. D: The taxonomic label assigned to the read by PhymmBL is correct. Then the confidence score associated with each clade-level prediction for the query read represents a *rough* fitted estimate of: P(D given A, B and C). The observed standard deviation of the error between the fitted estimates and actual observed experimental accuracy results ranged between 12-15% (for 100-bp reads) and 1-6% (for 1000-bp reads). Standard deviation of error is fairly consistent across clade levels, and depends mostly on read length. Because the estimates were computed for the precise situation described by statements B and C above, please note that actual p-values may differ from the estimates by more than the listed standard deviations, depending on what's in the database. For example, if the source species for a read *is* in the PhymmBL database, confidence values may be overestimating actual p-values; likewise, if the entire genus of the source species is missing from the PhymmBL database (a far more common case in metagenomics analysis), confidence values may be /underestimating/ actual p-values. Conclusion: if you use confidence scores as a predictive cutoff, don't set your cutoff too high, because more often than not, the confidence values will be conservatively lowballing actual p-values. A more organic way to set cutoffs is as follows: 1. Run your read set through PhymmBL. 2. Pick a mid-range clade level like family. 3. Sort the predicted taxonomic labels at your chosen level by abundance. 4. Pick a small percentage cutoff (say 0.75%), and treat all predictions whose abundance is higher than your chosen cutoff as confident, and the rest as suspect. If you want to be really careful about cutoffs, you might try first performing the abundance cutoff described above, followed by comparing the confidence scores associated with the "confident set" and the "suspect set." If the confidence scores separate well into two ranges, with those of the "confident set" higher than those of the "suspect set," then you now have a numeric confidence-score threshold which can then be used, for example, to pull in high-confidence predictions even from clades with very low predicted abundance, ultimately enhancing your final "confident set" results. PERFORMANCE NOTES: Please note that the IMM-construction portion of installation is particularly computationally intensive; full installation (including ICM construction) took about 24 hours on an Altus 3400 server with a 4x Opteron 850 processor, 32GB of memory, and no significant competing processes running during that time. Breakdown of our test installation, on the server described above, measured in wall-clock time: BEGIN 00h 00m 00s RefSeq download complete 00h 15m 43s .genomeData/ directory populated with downloaded data, organism directory-names standardized, taxonomic metadata files for Phymm generated, and included GLIMMER code compiled 00h 41m 10s IMM construction complete 22h 56m 14s local BLAST database construction complete (END) 23h 00m 34s Note that after the installer has assessed whether it should use a new copy of the RefSeq data or work from an existing one, the interactive portion of the installation is complete, and you won't need to actively monitor the installation. For classification, in the worst case, Phymm runs in time comparable to BLAST; 5,730 100-bp query reads took 36.3 minutes to classify with Phymm, and the same set took 26.9 minutes to search for matches in the local database using BLAST. For longer reads, BLAST's running time will usually be several times that of Phymm.