Quick Start Guide: create-data
#
This page provides a concise guide for using OpenSpliceAI's create-data
subcommand, which converts genomic sequences and annotations into HDF5-formatted training/testing datasets.
Before You Begin#
Install OpenSpliceAI: Ensure you have installed OpenSpliceAI and its dependencies as described in the Installation page.
Acquire Reference Files: You need a reference genome in FASTA format and a corresponding annotation (GFF/GTF) file.
Check Example Scripts: We provide an example script examples/create-data/create-data_example.sh
One-liner Start#
Reference Genome (FASTA): GCF_000001405.40_GRCh38.p14_genomic_10_sample.fna
Reference Annotation (GFF/GTF): MANE.GRCh38.v1.3.refseq_genomic_10_sample.gff
To create training and testing HDF5 files:
openspliceai create-data \
--remove-paralogs \
--min-identity 0.8 \
--min-coverage 0.8 \
--parse-type canonical \
--split-method human\
--canonical-only \
--genome-fasta GCF_000001405.40_GRCh38.p14_genomic_10_sample.fna \
--annotation-gff MANE.GRCh38.v1.3.refseq_genomic_10_sample.gff \
--output-dir train_test_dataset/
After this step, you should see two main files (dataset_train.h5
and dataset_test.h5
) in the specified output directory, along with intermediate files. These HDF5 files contain one-hot-encoded gene sequences and corresponding splice site labels.
Next Steps#
Explore ``create-data`` Options: Dive into the create-data documentation to learn how to customize your dataset creation process.
Further Customization: Experiment with additional command-line options, such as
--biotype
and--chr-split
, for even more tailored dataset creation.Begin Model Training: Follow the Quick Start Guide: train guide to start training your OpenSpliceAI model using your generated datasets.

