Quick Start Guide: create-data#

This page provides a concise guide for using OpenSpliceAI's create-data subcommand, which converts genomic sequences and annotations into HDF5-formatted training/testing datasets.


Before You Begin#

  • Install OpenSpliceAI: Ensure you have installed OpenSpliceAI and its dependencies as described in the Installation page.

  • Acquire Reference Files: You need a reference genome in FASTA format and a corresponding annotation (GFF/GTF) file.

  • Check Example Scripts: We provide an example script examples/create-data/create-data_example.sh


One-liner Start#

  1. Reference Genome (FASTA): GCF_000001405.40_GRCh38.p14_genomic_10_sample.fna

  2. Reference Annotation (GFF/GTF): MANE.GRCh38.v1.3.refseq_genomic_10_sample.gff

To create training and testing HDF5 files:

openspliceai create-data \
   --remove-paralogs \
   --min-identity 0.8 \
   --min-coverage 0.8 \
   --parse-type canonical \
   --split-method human\
   --canonical-only \
   --genome-fasta GCF_000001405.40_GRCh38.p14_genomic_10_sample.fna \
   --annotation-gff MANE.GRCh38.v1.3.refseq_genomic_10_sample.gff \
   --output-dir train_test_dataset/

After this step, you should see two main files (dataset_train.h5 and dataset_test.h5) in the specified output directory, along with intermediate files. These HDF5 files contain one-hot-encoded gene sequences and corresponding splice site labels.


Next Steps#

  • Explore ``create-data`` Options: Dive into the create-data documentation to learn how to customize your dataset creation process.

  • Further Customization: Experiment with additional command-line options, such as --biotype and --chr-split, for even more tailored dataset creation.

  • Begin Model Training: Follow the Quick Start Guide: train guide to start training your OpenSpliceAI model using your generated datasets.