Quick Start Guide: `train`#

This guide walks you through the essentials of using the train subcommand to build an OpenSpliceAI model from example data for splice site prediction. It converts HDF5 datasets (generated by the create-data subcommand) into a trained deep learning model.

Before You Begin#

Installation: Follow the instructions on the Installation page to install OpenSpliceAI and its dependencies.
Dataset Preparation: Ensure you have created the training and testing datasets using the create-data subcommand. Check the Quick Start Guide: create-data guide for details. You need two files as inputsw: dataset_train.h5 and dataset_test.h5.
Check Example Scripts: We provide an example script examples/train/train_example.sh

One-liner Start#

Suppose you have generated the following files from the create-data subcommand or you can directly download them from GitHub:

Training Dataset: dataset_train.h5
Testing Dataset: dataset_test.h5
Desired Flanking Size: 10,000 nt

Run this command to train your model:

openspliceai train \
   --flanking-size 10000 \
   --train-dataset dataset_train.h5 \
   --test-dataset dataset_test.h5 \
   --output-dir /path/to/model_train_outdir/ \
   --project-name human_MANE_example \
   --scheduler CosineAnnealingWarmRestarts \
   --loss cross_entropy_loss

This command will:

Load your training and testing HDF5 files.
Initialize and train the SpliceAI model using the specified flanking size.
Apply adaptive learning rate scheduling and early stopping.
Save model checkpoints (e.g., model_best.pt) and logs in the output directory.

The example outputs from this command can be found in the OpenSpliceAI GitHub repository

Note

Please note that the model trained in this experiment is not optimized for splice site prediction, as it was trained only on a small subset of the data. This example is intended solely to demonstrate the training process. For a fully optimized, pre-trained model, please refer to the Released OpenSpliceAI models guide.

Next Steps#

Explore the ``train`` Options: Delve into the train documentation to discover how you can further customize your training process.
Calibration (Optional): Improve the reliability of your model’s probability outputs. See the Quick Start Guide: calibrate guide for detailed calibration instructions.
Prediction: Ready to make predictions? Follow the Quick Start Guide: predict guide to use your trained model for splice site prediction.
Advanced Options: Experiment with additional training parameters (such as epochs and patience) to fine-tune your model’s performance.