EuPathDB

EuPathDB-Clean is a set of eukaryotic pathogen genomes originally downloaded from the EuPathDB Resource Project established by the National Institute of Allergy and Infectious Disease (NIAID/NIH). These draft genomes were processed to remove all contaminating and low-complexity sequences.

Here we provide the cleaned eukaryotic pathogen genome FASTA files that can be used with any metagenomics classification/analysis programs. For use, download each of the .tgz files and untar the FASTA files using:

for file in *.tgz; do tar -xvzf $file -C my_eupathDB_folder/; done

EuPathDB46 vs. EuPathDB26

As of December 1, 2020, we are releasing EuPathDB46-Clean. As compared to EuPathDB-28 (245 genomes), this database is larger (388 genomes) with fewer contaminating sequences. The database underwent the same contamination/low-complexity removal process with updated reference bacterial, archaeal, viral, vertebrate, and plant sequences.

EuPathDB-46 Downloads

EuPathDB-46: Kraken 2 Database
At the link above, we provide the pre-built Kraken 2 database along with the relevant Bracken files. The database is built from ONLY EuPathDB-46 genomes. In order to add EuPathDB-46 to an existing database, users will need to download the genomes themselves and build a new database.

EuPathDB46_Contents.txt lists all genomes in the database, including details about the filename, type, genus, species, and strain.

To build a database containing these genomes:

Make a folder for the database: mkdir my_db
Change directory to the database folder: cd my_db
Download the NCBI taxonomy: kraken2-build --download-taxonomy --db .
Download the seqid2taxid.map file for EuPathDB46: wget ftp://ftp.ccb.jhu.edu/pub/data/EuPathDB46/seqid2taxid.map
Make a libarary folder: mkdir library
Move to the library folder: cd library/
Download all of the genomes using the links below: wget ftp://ftp.ccb.jhu.edu/pub/data/EuPathDB46/AmoebaDB46.tgz
Uncompress all of the genome folders: tar -xzvf AmoebaDB46.tgz
Move back to the database folder: cd ../
Build the Kraken/Kraken 2 database: kraken2-build --build --db .

Class	Genus Composition *	Number of EuPathDB-48 Files	Download	Uncompressed
AmoebaDB	Acanthamoeba, Entamoeba, Naegleria	30	AmoebaDB46_Clean.tgz (360 Mb)	1.39 Gb
CryptoDB	Chromera, Cryptosporidium, Gregarina, Vitrella	18	CryptoDB46_Clean.tgz (98 Mb)	345 Mb
FungiDB	Allomyces, Aspergillus, Candida, Clavispora, Coccidioides, Coprinopsis, Histoplasma, Malassezia, Melampsora, Mucor, Penicillium, Pythium, Saccharomyces, Yarrowia, etc.	164	FungiDB46_Clean.tgz (1.8 Gb)	6.2 Gb
GiardiaDB	Giardia, Monocercomonoides, Spironucleus	10	GiardiaDB46_Clean.tgz (50 Mb)	179 Mb
MicrosporidiaDB	Anncaliia, Edhazardia, Enterospora, Hepatospora, Mitosporidium, Nematocida, Spraguea, Vittaforma, etc.	35	MicrosporidiaDB46_Clean.tgz (62 Mb)	245 Mb
PiroplasmaDB	Babesia, Cytauxzoon, Theileria	10	PiroplasmaDB46_Clean.tgz (29 Mb)	101 Mb
PlasmoDB	Plasmodium	45	PlasmoDB46_Clean.tgz (185 Mb)	1.1 Gb
ToxoDB	Cyclospora, Cystoisospora, Eimeria, Hammondia, Neospora, Sarcocystis, Toxoplasma	33	ToxoDB46_Clean.tgz (550 Mb)	2.1 Gb
TrichDB	Trichomonas	1	TrichDB46_Clean.tgz (50 Mb)	180 Mb
TriTrypDB	Blechomonas, Bodo, Leishmania, Leptomonas, Paratrypanosoma, Trypanosoma	42	TriTrypDB46_Clean.tgz (406 Mb)	1.5 Gb
Total EuPathDB-46-Clean		388		13.4 GB

EuPathDB-28 Downloads

EuPathDB-28 Library: eupathDB.tar.gz (2.2 GB)
This file contains all 245 genomes in a single folder. Each genome is a multi-fasta file with Kraken and Kraken2-compatible headers. To build a Kraken/Kraken2 database with these files, unzip the folder and place it directly within the library/ folder.

For example, to build a database with human, bacteria, and eupathDB:
tar -xzvf eupathDB.tar.gz [creates library folder with eupathDB files]
mv library/ $DBNAME/library/
kraken2-build --download-library bacteria --db $DBNAME
kraken2-build --download-library human --db $DBNAME
kraken2-build --build --db $DBNAME
(It is NOT required to use kraken2-build --add-to-library for these files.)

EuPathDB-28 Kraken2 Database: eupathDB_kraken2.tar.gz (5.6 GB)
This folder is a pre-built Kraken2 database of the 245 eupathDB genomes. It contains three files: hash.k2d, opts.k2d, and taxo.k2d.

For example, to use this pre-built database:
tar -xzvf eupathDB_kraken2.tar.gz
kraken2 --db eupathDB_kraken2 MYSAMPLE.FNA > MYSAMPLE.KRAKEN2
(Do not run kraken2-build --build on this folder.)
(This database is NOT compatible with Kraken 1)

For more information on Kraken and Kraken 2, see: Kraken's Website and Kraken2's Website

Class	Genus Composition *	Number of EuPathDB-28 Files	Download	Uncompressed
AmoebaDB	Acanthamoeba, Entamoeba, Naegleria	29	AmoebaDB_Clean.tgz (326 MB)	1.37 GB
CryptoDB	Chromera, Cryptosporidium, Gregarina, Vitrella	11	CryptoDB_Clean.tgz (86.5 MB)	359 MB
FungiDB	Ajellomyces, Aspergillus, Candida, Coccidioides, Cryptococus, Fusarium, Rhizopus, Saccharomyces, Trichoderma	87	FungiDB_Clean.tgz (931 MB)	3.5 GB
GiardiaDB	Giardia, Spironucleus	6	GiardiaDB_Clean.tgz (21 MB)	72 MB
MicrosporidiaDB	Anncaliia, Encephalitozoon, Mitosporidia, Nematocida	25	MicrosporidiaDB_Clean.tgz (35 MB)	199 MB
PiroplasmaDB	Babesia, Cytauxzoon, Theileria	8	PiroplasmaDB_Clean.tgz (21 MB)	76 MB
PlasmoDB	Plasmodium	9	PlasmoDB_Clean.tgz (27 MB)	216 MB
ToxoDB	Cyclospora, Eimeria, Hammondia, Neospora, Sarcocystis, Toxoplasma	30	ToxoDB_Clean.tgz (456 MB)	1.8 Gb
TrichDB	Trichomonas	1	TrichDB_Clean.tgz (43 MB)	180 MB
TriTrypDB	Leishmania, Leptomonas, Trypanosoma	39	TriTrypDB_Clean.tgz (336 MB)	1.3 GB
Total EuPathDB-Clean		245	2.3 GB	9.1 GB

*The genuses listed for FungiDB and MicrosporidiaDB are a subset of all genuses represented. For full list of genuses contained in these classes, see the publication supplementary table.

Publications

The publication associated with the data is located at:
J. Lu, S.L. Salzberg. (2018). "Removing contaminants from databases of draft genomes." PLoS Comput Biol https://doi.org/10.1371/journal.pcbi.1006277

Authors/Contributors

Jennifer Lu, Ph.D. ( jlu26 jhmi edu )
Steven Salzberg, Ph.D.

Page Updated: 2020/12/01

Back to top