create-data
#
Subcommand description#
This subcommand processes genomic sequences (FASTA) and genome annotations (GFF/GTF) files into Hierarchical Data Format version 5 (HDF5) files.
Input#
reference Genome \(G\) in FASTA: this file is the reference genome.
reference Annotation \(A\) in GFF3: this file includes the annotated gene features.
Output#
The output consists of the training and testing HDF5 files containing the processed gene sequences and their corresponding labels.
Processing Steps#
For each gene sequence in the genome, the following processing steps are performed:
Gene Sequence Processing:
Each gene sequence is transformed into a 3D tensor, denoted as \(X\).
The shape of \(X\) is \(⌈L/W⌉ \times (F + W) \times 4\), where:
\(L\) is the length of the gene sequence.
\(W\) is the chunking window size (default is 5000nt in SpliceAI).
\(F\) is the flanking sequence size (80nt, 400nt, 2,000nt, 10,000nt).
4 represents the number of nucleotides (A, C, G, T).
The last dimension of \(X\) is appended with Ns to make the length of each gene sequence a multiple of \(W\).
Label Generation:
Labels, denoted as \(Y\), are generated from the genome annotations.
The shape of \(Y\) is \(⌈L/W⌉ \times W \times 3\), where each site in the gene sequence is labeled as:
Donor site
Acceptor site
Non-splice site
Example#
For a gene sequence of length 12,000 with a chunking window size (\(W\)) of 5000 and 10k flanking sequence (\(F=10,000\)), the resulting 3D tensor \(X\) would have a shape of \(⌈12000/5000⌉ \times (10000 + 5000) \times 4 = 3 \times 5000 \times 4\).
The corresponding label tensor \(Y\) would have a shape of \(⌈12000/5000⌉ \times 5000 \times 3 = 3 \times 5000 \times 3\).
Example of human MANE#
Input files#
To run this example, you will need to download the following two input files:
reference Genome \(G\) in FASTA : GCF_000001405.40_GRCh38.p14_genomic.fna
reference Annotation \(A\) in GFF3 : MANE.GRCh38.v1.3.refseq_genomic.gff
Commands#
The command of spliceAI-toolkit to create training and testing datasets is as follows:
spliceai-toolkit create-data \
--genome-fasta GCF_000001405.40_GRCh38.p14_genomic.fna \
--annotation-gff MANE.GRCh38.v1.3.refseq_genomic.gff \
--output-dir train_test_dataset_MANE/
After successfully running the create-data
subcommand, you will get the following two main files for model training and testing and other intermediate files:
Output files#
dataset_train.h5: this is the main file for model training.
dataset_test.h5: this is the main file for model testing.
intermediate files:
datafile_train.h5
datafile_test.h5
stats.txt
Usage#
usage: spliceai-toolkit create-data [-h] --annotation-gff ANNOTATION_GFF --genome-fasta GENOME_FASTA --output-dir OUTPUT_DIR [--parse-type {maximum,all_isoforms}] [--biotype {protein-coding,non-coding}]
[--chr-split {train-test,test}]
optional arguments:
-h, --help show this help message and exit
--annotation-gff ANNOTATION_GFF
Path to the GFF file
--genome-fasta GENOME_FASTA
Path to the FASTA file
--output-dir OUTPUT_DIR
Output directory to save the data
--parse-type {maximum,all_isoforms}
Type of transcript processing
--biotype {protein-coding,non-coding}
Biotype of transcript processing
--chr-split {train-test,test}
The chromosome splitting approach for training and testing

