CCB » Software » EuPathDB


EuPathDB-Clean is a set of 245 eukaryotic pathogen genomes originally downloaded from the EuPathDB Resource Project established by the National Institute of Allergy and Infectious Disease (NIAID/NIH). The 245 draft genomes were processed to remove all contaminating and low-complexity sequences.

Here we provide the 245 cleaned eukaryotic pathogen genome FASTA files that can be used with any metagenomics classification/analysis programs. For use, download each of the .tgz files and untar the FASTA files using:
  • for file in *.tgz; do tar -xvzf $file -C my_eupathDB_folder/; done

NOTE: If you downloaded files before August 2, 2018, there are duplicate FASTA entries in each file. The files below have been updated/fixed to include only one copy of each FASTA sequence.

Downloads

Class Genus Composition * Number of Files Download Uncompressed
AmoebaDB Acanthamoeba, Entamoeba, Naegleria 29 AmoebaDB_Clean.tgz (326 MB) 1.37 GB
CryptoDB Chromera, Cryptosporidium,
Gregarina, Vitrella
11 CryptoDB_Clean.tgz (86.5 MB) 359 MB
FungiDB Ajellomyces, Aspergillus, Candida,
Coccidioides, Cryptococus, Fusarium,
Rhizopus, Saccharomyces, Trichoderma
87 FungiDB_Clean.tgz (931 MB) 3.5 GB
GiardiaDB Giardia, Spironucleus 6 GiardiaDB_Clean.tgz (21 MB) 72 MB
MicrosporidiaDB Anncaliia, Encephalitozoon, Mitosporidia, Nematocida 25 MicrosporidiaDB_Clean.tgz (35 MB) 199 MB
PiroplasmaDB Babesia, Cytauxzoon, Theileria 8 PiroplasmaDB_Clean.tgz (21 MB) 76 MB
PlasmoDB Plasmodium 9 PlasmoDB_Clean.tgz (27 MB) 216 MB
ToxoDB Cyclospora, Eimeria, Hammondia,
Neospora, Sarcocystis, Toxoplasma
30 ToxoDB_Clean.tgz (456 MB) 1.8 Gb
TrichDB Trichomonas 1 TrichDB_Clean.tgz (43 MB) 180 MB
TriTrypDB Leishmania, Leptomonas, Trypanosoma 39 TriTrypDB_Clean.tgz (336 MB) 1.3 GB
Total EuPathDB-Clean 245 2.3 GB 9.1 GB
*The genuses listed for FungiDB and MicrosporidiaDB are a subset of all genuses represented. For full list of genuses contained in these classes, see the publication supplementary table.

To make a Kraken database from the above genomes, download the genomes and add them directly to your database's library/ folder and download this seqid2taxid.map file and place it directly in your database folder. From there, you can build the database as normal.

Publications

The publication associated with the data is located at:
J. Lu, S.L. Salzberg. (2018). "Removing contaminants from databases of draft genomes." PLoS Comput Biol https://doi.org/10.1371/journal.pcbi.1006277

Authors/Contributors

Jennifer Lu ( jlu26 jhmi edu ) Ph.D. Candidate
Steven Salzberg, Ph.D.

Page Updated: 2018/01/30

Back to top