Quick Start Guide: `create-data`#

This page provides a concise guide for using OpenSpliceAI's create-data subcommand, which converts genomic sequences and annotations into HDF5-formatted training/testing datasets.

Before You Begin#

Install OpenSpliceAI: Ensure you have installed OpenSpliceAI and its dependencies as described in the Installation page.
Acquire Reference Files: You need a reference genome in FASTA format and a corresponding annotation (GFF/GTF) file.
Check Example Scripts: We provide an example script examples/create-data/create-data_example.sh

One-liner Start#

Reference Genome (FASTA): GCF_000001405.40_GRCh38.p14_genomic_10_sample.fna
Reference Annotation (GFF/GTF): MANE.GRCh38.v1.3.refseq_genomic_10_sample.gff

To create training and testing HDF5 files:

openspliceai create-data \
   --remove-paralogs \
   --min-identity 0.8 \
   --min-coverage 0.8 \
   --parse-type canonical \
   --split-method human\
   --canonical-only \
   --genome-fasta GCF_000001405.40_GRCh38.p14_genomic_10_sample.fna \
   --annotation-gff MANE.GRCh38.v1.3.refseq_genomic_10_sample.gff \
   --output-dir train_test_dataset/

After this step, you should see two main files (dataset_train.h5 and dataset_test.h5) in the specified output directory, along with intermediate files. These HDF5 files contain one-hot-encoded gene sequences and corresponding splice site labels.

Next Steps#

Explore ``create-data`` Options: Dive into the create-data documentation to learn how to customize your dataset creation process.
Further Customization: Experiment with additional command-line options, such as --biotype and --chr-split, for even more tailored dataset creation.
Begin Model Training: Follow the Quick Start Guide: train guide to start training your OpenSpliceAI model using your generated datasets.

Quick Start Guide: create-data#

Before You Begin#

One-liner Start#

Next Steps#

Quick Start Guide: `create-data`#