`create-data`#

Subcommand description#

This subcommand processes genomic sequences (FASTA) and genome annotations (GFF/GTF) files into Hierarchical Data Format version 5 (HDF5) files.

Input#

reference Genome \(G\) in FASTA: this file is the reference genome.
reference Annotation \(A\) in GFF3: this file includes the annotated gene features.

Output#

The output consists of the training and testing HDF5 files containing the processed gene sequences and their corresponding labels.

Processing Steps#

For each gene sequence in the genome, the following processing steps are performed:

Gene Sequence Processing:
- Each gene sequence is transformed into a 3D tensor, denoted as \(X\).
- The shape of \(X\) is \(⌈L/W⌉ \times (F + W) \times 4\), where:
  - \(L\) is the length of the gene sequence.
  - \(W\) is the chunking window size (default is 5000nt in SpliceAI).
  - \(F\) is the flanking sequence size (80nt, 400nt, 2,000nt, 10,000nt).
  - 4 represents the number of nucleotides (A, C, G, T).
- The last dimension of \(X\) is appended with Ns to make the length of each gene sequence a multiple of \(W\).
Label Generation:
- Labels, denoted as \(Y\), are generated from the genome annotations.
- The shape of \(Y\) is \(⌈L/W⌉ \times W \times 3\), where each site in the gene sequence is labeled as:
  - Donor site
  - Acceptor site
  - Non-splice site

Example#

For a gene sequence of length 12,000 with a chunking window size (\(W\)) of 5000 and 10k flanking sequence (\(F=10,000\)), the resulting 3D tensor \(X\) would have a shape of \(⌈12000/5000⌉ \times (10000 + 5000) \times 4 = 3 \times 5000 \times 4\).

The corresponding label tensor \(Y\) would have a shape of \(⌈12000/5000⌉ \times 5000 \times 3 = 3 \times 5000 \times 3\).

Example of human MANE#

Input files#

To run this example, you will need to download the following two input files:

reference Genome \(G\) in FASTA : GCF_000001405.40_GRCh38.p14_genomic.fna
reference Annotation \(A\) in GFF3 : MANE.GRCh38.v1.3.refseq_genomic.gff

Commands#

The command of spliceAI-toolkit to create training and testing datasets is as follows:

spliceai-toolkit create-data \
--genome-fasta  GCF_000001405.40_GRCh38.p14_genomic.fna \
--annotation-gff MANE.GRCh38.v1.3.refseq_genomic.gff \
--output-dir train_test_dataset_MANE/

After successfully running the create-data subcommand, you will get the following two main files for model training and testing and other intermediate files:

Output files#

dataset_train.h5: this is the main file for model training.
dataset_test.h5: this is the main file for model testing.
intermediate files:
- datafile_train.h5
- datafile_test.h5
- stats.txt

Usage#

usage: spliceai-toolkit create-data [-h] --annotation-gff ANNOTATION_GFF --genome-fasta GENOME_FASTA --output-dir OUTPUT_DIR [--parse-type {maximum,all_isoforms}] [--biotype {protein-coding,non-coding}]
                                    [--chr-split {train-test,test}]

optional arguments:
-h, --help            show this help message and exit
--annotation-gff ANNOTATION_GFF
                        Path to the GFF file
--genome-fasta GENOME_FASTA
                        Path to the FASTA file
--output-dir OUTPUT_DIR
                        Output directory to save the data
--parse-type {maximum,all_isoforms}
                        Type of transcript processing
--biotype {protein-coding,non-coding}
                        Biotype of transcript processing
--chr-split {train-test,test}
                        The chromosome splitting approach for training and testing