create-data#
The create-data subcommand processes genomic sequences (FASTA) and genome annotations (GFF/GTF) to produce HDF5-formatted (Hierarchical Data Format version 5) datasets for splice site prediction. It transforms a reference genome and its annotations into one-hot-encoded pre-mRNA sequences (input tensor \(X\)) along with corresponding splice site labels (output tensor \(Y\)). This subcommand is part of the OpenSpliceAI toolkit and is designed to create high-quality training and testing datasets.
Input Files#
Reference Genome (FASTA)
A FASTA file containing the genomic sequence.
Example:
GCF_000001405.40_GRCh38.p14_genomic.fna <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/GCF_000001405.40_GRCh38.p14_genomic.fna>Reference Annotation (GFF/GTF)
A GFF3 file containing genome annotations.
Example:
MANE.GRCh38.v1.3.refseq_genomic.gff <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/MANE.GRCh38.v1.3.refseq_genomic.gff>
Output Files#
The subcommand produces several outputs:
Training Dataset:
dataset_train.h5 <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/dataset_train.h5>Testing Dataset:
dataset_test.h5 <ftp://ftp.ccb.jhu.edu/pub/data/OpenSpliceAI/train_data/spliceai-mane/dataset_test.h5>Intermediate Files:
datafile_train.h5datafile_test.h5stats.txt
These files contain the processed gene sequences (tensor \(X\)) and splice site labels (tensor \(Y\)).
Usage#
usage: openspliceai create-data [-h] --annotation-gff ANNOTATION_GFF --genome-fasta GENOME_FASTA --output-dir OUTPUT_DIR
[--parse-type {canonical,all_isoforms}] [--biotype {protein-coding,non-coding}] [--chr-split {train-test,test}]
[--split-method {random,human}] [--split-ratio SPLIT_RATIO] [--canonical-only] [--flanking-size FLANKING_SIZE]
[--verify-h5] [--remove-paralogs] [--min-identity MIN_IDENTITY] [--min-coverage MIN_COVERAGE] [--write-fasta]
optional arguments:
-h, --help show this help message and exit
--annotation-gff ANNOTATION_GFF
Path to the GFF file
--genome-fasta GENOME_FASTA
Path to the FASTA file
--output-dir OUTPUT_DIR
Output directory to save the data
--parse-type {canonical,all_isoforms}
Type of transcript processing
--biotype {protein-coding,non-coding}
Biotype of transcript processing
--chr-split {train-test,test}
Whether to obtain testing or both training and testing groups
--split-method {random,human}
Chromosome split method for training and testing dataset
--split-ratio SPLIT_RATIO
Ratio of training and testing dataset
--canonical-only Flag to obtain only canonical splice site pairs
--flanking-size FLANKING_SIZE
Sum of flanking sequence lengths on each side of input (i.e. 40+40)
--verify-h5 Verify the generated HDF5 file(s)
--remove-paralogs Remove paralogous sequences between training and testing dataset
--min-identity MIN_IDENTITY
Minimum minimap2 alignment identity for paralog removal between training and testing dataset
--min-coverage MIN_COVERAGE
Minimum minimap2 alignment coverage for paralog removal between training and testing dataset
--write-fasta Flag to write out sequences into fasta files
Examples#
Example: Creating the Human MANE Dataset#
To generate training and testing datasets for the human MANE dataset, first download the following files:
Reference Genome:
GCF_000001405.40_GRCh38.p14_genomic.fnaReference Annotation:
MANE.GRCh38.v1.3.refseq_genomic.gff
Then, execute the command:
openspliceai create-data \
--genome-fasta GCF_000001405.40_GRCh38.p14_genomic.fna \
--annotation-gff MANE.GRCh38.v1.3.refseq_genomic.gff \
--output-dir train_test_dataset_MANE/
After a successful run, the following files will be generated:
dataset_train.h5 (main file for training)
dataset_test.h5 (main file for testing)
Intermediate Files:
datafile_train.h5,datafile_test.h5, andstats.txt
Processing Pipeline#
Sequence Conversion: Each gene is transformed into a 3D tensor \(X\) using one-hot encoding. The nucleotides are represented as follows:
A = [1, 0, 0, 0]
C = [0, 1, 0, 0]
G = [0, 0, 1, 0]
T (or U) = [0, 0, 0, 1]
N = [0, 0, 0, 0]
Chunking and Padding: The gene sequence is split into overlapping chunks with:
Window size (W): Default 5,000 nucleotides.
Flanking sequence (F): Typical sizes include 80, 400, 2,000, or 10,000 nucleotides.
The tensor \(X\) is given by:
\[\lceil L / W \rceil \times (F + W) \times 4\]where \(L\) is the gene length. Any remaining sequence is padded with
Ncharacters so that each chunk is a multiple of the window size.Example: For a gene of 12,000 nucleotides with
W = 5000andF = 10,000, the tensor \(X\) will have the shape:\[\lceil 12000 / 5000 \rceil \times (10000 + 5000) \times 4 = 3 \times 15000 \times 4\]
Label Tensor Construction: Labels are generated from genome annotations and encoded as follows:
Donor site: [0, 0, 1]
Acceptor site: [0, 1, 0]
Non-splice site: [1, 0, 0]
Padding: [0, 0, 0]
The resulting label tensor \(Y\) has the shape:
\[\lceil L / W \rceil \times W \times 3\]
Overlapping Chunks: Following SpliceAI’s methodology, gene sequences are divided into overlapping chunks using a step size equal to the window size (5,000 nucleotides). Flanking sequences (also defaulted to 5,000 nucleotides) are appended on each side.
Example: A 22,000-nucleotide gene is divided into 5 chunks, resulting in a tensor shape of
(5, 15,000, 4)for the sequences and(5, 5,000, 3)for the labels.
For each gene locus, the longest transcript is selected as the canonical transcript.
By default, the
--biotypeflag is set to protein-coding, thereby filtering out non-coding genes.
Automatic Splitting: If not explicitly provided, the toolkit splits chromosomes into training and testing sets with an approximate 80:20 ratio.
Chromosome lengths are retrieved from the GFFUtils database (https://github.com/daler/gffutils).
For non-human species, a random shuffling method is applied by default.
User-Specified Splitting: Use the
--chr-splitoption to manually specify chromosome assignments.
Pseudogenes Filtering: Genes marked as
pseudogene(either in the feature type or via thegene_biotypeattribute) are excluded from the test dataset.Paralogous Gene Removal: To avoid sequence similarity between training and testing sets (and thus data leakage), the toolkit uses mappy (a Python wrapper for minimap2) with the
--asm20argument (allowing a divergence threshold of 5%). Test sequences sharing over 80% similarity and 80% coverage with training sequences are removed.
Canonical-only Mode: The
--canonical-onlyflag restricts label generation to conserved splice site motifs. These include:U2-snRNP motifs:
GT-AGandGC-AGU12-snRNP motifs:
GT-AGandAT-AC
This option helps mitigate the effect of misannotated splice sites.
Conclusion#
The create-data subcommand provides a robust framework for transforming genomic and annotation data into machine-learning-ready datasets. Its careful handling of transcript selection, sequence chunking, one-hot encoding, and rigorous filtering of pseudogenes and paralogous sequences ensures high-quality training and testing sets for splice site prediction. For more details and troubleshooting, please refer to the OpenSpliceAI help command or the official documentation.