train
#
Subcommand description#
This subcommand takes the training and testing HDF5 files, which are the outputs from the create-data subcommand, and trains a deep learning model to predict splice sites.
Input#
The output consists of the training and testing HDF5 files containing the processed gene sequences and their corresponding labels.
training HDF5 file : the dataset is used for model training.
testing HDF5 file : the dataset is held out for model testing.
Output#
The main output file is the trained model in PT file, storing the model weights and architecture. The training and testing logs are also saved in log files.
Processing Steps#
The SpliceAI-pytorch is trained using the following steps with hyperparameters:
Optimization: - The model utilizes the AdamW optimizer with initial learning rate 1e-3, and the ReduceLROnPlateau adaptive learning rate scheduler, with
mode='min'
,factor=0.5
, andpatience=2
.Dataset Split: - The training dataset is split into 90% for training and 10% for testing.
Training: - The model is trained for 20 epochs. - An early stopping condition is applied: if the validation loss does not improve for 5 consecutive epochs, the training stops early.
Example of human MANE#
Input files#
To run this example, you will need the following two files. They can be either downloaded through the provided links or generated using the create-data subcommand.
dataset_train.h5: this is the dataset for model training.
dataset_test.h5: this is the dataset for model testing.
Commands#
The command of spliceAI-toolkit to train the spliceAI-Pytorch model is as follows:
spliceai-toolkit train --flanking-size 10000 \
--exp-num full_dataset \
--train-dataset /ccb/cybertron/khchao/data/train_test_dataset_MANE_test/dataset_train.h5 \
--test-dataset /ccb/cybertron/khchao/data/train_test_dataset_MANE_test/dataset_test.h5 \
--output-dir /ccb/cybertron/khchao/spliceAI-toolkit/results/model_train_outdir/ \
--project-name human_MANE_adeptive_lr \
--random-seed 22 \
--model SpliceAI \
--loss cross_entropy_loss -d
After successfully running the train
subcommand, you will get the following trained model and log files:
Output files#
dataset_train.h5: the trained SpliceAI-Pytorch model.
Usage#
usage: spliceai-toolkit train [-h] [--disable-wandb] --output-dir OUTPUT_DIR [--project-name PROJECT_NAME] [--flanking-size FLANKING_SIZE]
[--random-seed RANDOM_SEED] [--exp-num EXP_NUM] [--train-dataset TRAIN_DATASET] [--test-dataset TEST_DATASET]
[--loss LOSS] [--model MODEL]
options:
-h, --help show this help message and exit
--disable-wandb, -d
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
Output directory to save the data
--project-name PROJECT_NAME, -s PROJECT_NAME
--flanking-size FLANKING_SIZE, -f FLANKING_SIZE
--random-seed RANDOM_SEED, -r RANDOM_SEED
--exp-num EXP_NUM, -e EXP_NUM
--train-dataset TRAIN_DATASET, -train TRAIN_DATASET
--test-dataset TEST_DATASET, -test TEST_DATASET
--loss LOSS, -l LOSS The loss function to train SpliceAI model
--model MODEL, -m MODEL

