create-data#
The create-data
subcommand processes genomic sequences (FASTA) and genome annotations (GFF/GTF) to produce HDF5-formatted (Hierarchical Data Format version 5) datasets for splice site prediction. It transforms a reference genome and its annotations into one-hot-encoded pre-mRNA sequences (input tensor
Input Files#
Reference Genome (FASTA)
A FASTA file containing the genomic sequence.
Example:
GCF_000001405.40_GRCh38.p14_genomic.fna <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/GCF_000001405.40_GRCh38.p14_genomic.fna>
Reference Annotation (GFF/GTF)
A GFF3 file containing genome annotations.
Example:
MANE.GRCh38.v1.3.refseq_genomic.gff <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/MANE.GRCh38.v1.3.refseq_genomic.gff>
Output Files#
The subcommand produces several outputs:
Training Dataset:
dataset_train.h5 <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/dataset_train.h5>
Testing Dataset:
dataset_test.h5 <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/dataset_test.h5>
Intermediate Files:
datafile_train.h5
datafile_test.h5
stats.txt
These files contain the processed gene sequences (tensor
Usage#
usage: openspliceai create-data [-h] --annotation-gff ANNOTATION_GFF --genome-fasta GENOME_FASTA --output-dir OUTPUT_DIR
[--parse-type {canonical,all_isoforms}] [--biotype {protein-coding,non-coding}] [--chr-split {train-test,test}]
[--split-method {random,human}] [--split-ratio SPLIT_RATIO] [--canonical-only] [--flanking-size FLANKING_SIZE]
[--verify-h5] [--remove-paralogs] [--min-identity MIN_IDENTITY] [--min-coverage MIN_COVERAGE] [--write-fasta]
optional arguments:
-h, --help show this help message and exit
--annotation-gff ANNOTATION_GFF
Path to the GFF file
--genome-fasta GENOME_FASTA
Path to the FASTA file
--output-dir OUTPUT_DIR
Output directory to save the data
--parse-type {canonical,all_isoforms}
Type of transcript processing
--biotype {protein-coding,non-coding}
Biotype of transcript processing
--chr-split {train-test,test}
Whether to obtain testing or both training and testing groups
--split-method {random,human}
Chromosome split method for training and testing dataset
--split-ratio SPLIT_RATIO
Ratio of training and testing dataset
--canonical-only Flag to obtain only canonical splice site pairs
--flanking-size FLANKING_SIZE
Sum of flanking sequence lengths on each side of input (i.e. 40+40)
--verify-h5 Verify the generated HDF5 file(s)
--remove-paralogs Remove paralogous sequences between training and testing dataset
--min-identity MIN_IDENTITY
Minimum minimap2 alignment identity for paralog removal between training and testing dataset
--min-coverage MIN_COVERAGE
Minimum minimap2 alignment coverage for paralog removal between training and testing dataset
--write-fasta Flag to write out sequences into fasta files
Examples#
Example: Creating the Human MANE Dataset#
To generate training and testing datasets for the human MANE dataset, first download the following files:
Reference Genome:
GCF_000001405.40_GRCh38.p14_genomic.fna
Reference Annotation:
MANE.GRCh38.v1.3.refseq_genomic.gff
Then, execute the command:
openspliceai create-data \
--genome-fasta GCF_000001405.40_GRCh38.p14_genomic.fna \
--annotation-gff MANE.GRCh38.v1.3.refseq_genomic.gff \
--output-dir train_test_dataset_MANE/
After a successful run, the following files will be generated:
dataset_train.h5 (main file for training)
dataset_test.h5 (main file for testing)
Intermediate Files:
datafile_train.h5
,datafile_test.h5
, andstats.txt
Processing Pipeline#
Sequence Conversion: Each gene is transformed into a 3D tensor
using one-hot encoding. The nucleotides are represented as follows:A = [1, 0, 0, 0]
C = [0, 1, 0, 0]
G = [0, 0, 1, 0]
T (or U) = [0, 0, 0, 1]
N = [0, 0, 0, 0]
Chunking and Padding: The gene sequence is split into overlapping chunks with:
Window size (W): Default 5,000 nucleotides.
Flanking sequence (F): Typical sizes include 80, 400, 2,000, or 10,000 nucleotides.
The tensor
is given by:where
is the gene length. Any remaining sequence is padded withN
characters so that each chunk is a multiple of the window size.Example: For a gene of 12,000 nucleotides with
W = 5000
andF = 10,000
, the tensor will have the shape:
Label Tensor Construction: Labels are generated from genome annotations and encoded as follows:
Donor site: [0, 0, 1]
Acceptor site: [0, 1, 0]
Non-splice site: [1, 0, 0]
Padding: [0, 0, 0]
The resulting label tensor
has the shape:
Overlapping Chunks: Following SpliceAI’s methodology, gene sequences are divided into overlapping chunks using a step size equal to the window size (5,000 nucleotides). Flanking sequences (also defaulted to 5,000 nucleotides) are appended on each side.
Example: A 22,000-nucleotide gene is divided into 5 chunks, resulting in a tensor shape of
(5, 15,000, 4)
for the sequences and(5, 5,000, 3)
for the labels.
For each gene locus, the longest transcript is selected as the canonical transcript.
By default, the
--biotype
flag is set to protein-coding, thereby filtering out non-coding genes.
Automatic Splitting: If not explicitly provided, the toolkit splits chromosomes into training and testing sets with an approximate 80:20 ratio.
Chromosome lengths are retrieved from the GFFUtils database (https://github.com/daler/gffutils).
For non-human species, a random shuffling method is applied by default.
User-Specified Splitting: Use the
--chr-split
option to manually specify chromosome assignments.
Pseudogenes Filtering: Genes marked as
pseudogene
(either in the feature type or via thegene_biotype
attribute) are excluded from the test dataset.Paralogous Gene Removal: To avoid sequence similarity between training and testing sets (and thus data leakage), the toolkit uses mappy (a Python wrapper for minimap2) with the
--asm20
argument (allowing a divergence threshold of 5%). Test sequences sharing over 80% similarity and 80% coverage with training sequences are removed.
Canonical-only Mode: The
--canonical-only
flag restricts label generation to conserved splice site motifs. These include:U2-snRNP motifs:
GT-AG
andGC-AG
U12-snRNP motifs:
GT-AG
andAT-AC
This option helps mitigate the effect of misannotated splice sites.
Conclusion#
The create-data
subcommand provides a robust framework for transforming genomic and annotation data into machine-learning-ready datasets. Its careful handling of transcript selection, sequence chunking, one-hot encoding, and rigorous filtering of pseudogenes and paralogous sequences ensures high-quality training and testing sets for splice site prediction. For more details and troubleshooting, please refer to the OpenSpliceAI help command or the official documentation.

