Q & A ...#

Q: What is Splam?

Splam stands for two things: (1) Splam refers to the deep grouped residual CNN model that we designed to accurately predict splice junctions (based solely on an input DNA sequence), and (2) it also stands for this software which can clean up alignment files and evaluate annotation files.

Q: Why do we need Splam?

We are concerned about the method of training splice junction predictors by relying on splice junctions in solely canonical transcripts. Designing a splice site recognition method based only on one isoform per gene may result in mislabeling alternative splice sites even when they are perfectly valid. Therefore,

we designed a biologically realistic model. Splam was trained on combined donor and acceptor pairs, with a focus on a narrow window of 400 base pairs surrounding each splice site. This approach is inspired by the understanding that the splicing process primarily relies on signals within this specific region.

Furthermore, there are two applications of Splam:

When inspecting an alignment file in IGV, it becomes apparent that some reads are spliced and aligned across different gene loci or intergenic regions. This raises the question, "Are these spliced alignments correct?" Therefore,

we need a trustworthy way to evaluate all the spliced alignments in the alignment file. Splam learns splice junction patterns, and we have demonstrated that applying Splam to remove spurious spliced alignments improves transcript assembly! alignment evaluation section.

Additionally, we acknowledge that annotation files are not perfect, and there are more errors in the assembled transcripts. The current approach to assessing assembled transcripts involves comparing them with the annotation.

we can utilize Splam to score all introns in transcripts and provide a reference-free evalutation. annotation evaluation section.

Q: What makes Splam different from SpliceAI?

Splam and SpliceAI are both frameworks used for predicting splice junctions in DNA sequences, but they have some key differences.

Input constraints:
- Splam: Follows the design principle of using biologically realistic input constraints. It uses a window limited to 200 base pairs on each side of the donor and acceptor sites, totaling 800 base pairs. Furthermore, we pair each donor and acceptor as follows
- SpliceAI: The previous state-of-the-art CNN-based system, SpliceAI, relies on a window of 10,000 base pairs flanking each splice site to obtain maximal accuracy. However, this window size is much larger than what the splicing machinery in cells can recognize.
Training data
- Splam: Was trained using a high-quality dataset of human donor and acceptor sites. Check out the data curation section.
- SpliceAI: Was trained with canonical transcripts only, and does not consider alternative splicing.

Q: What is the model architecture of Splam?

Check out the model architecture section.

Q: For Splam model design, why do we use five residual groups?

In the design of the Splam model, the inclusion of five residual groups is a result of the experiment to determine the optimal structure for splice site identification. This architectural choice was informed by an ablation study that varied the number of residual groups within the model, aiming to balance complexity with performance efficacy.

Our experiment demonstrated that each additional residual group up to the fifth contributed to the best improvement in the model's performance metrics, including top-k accuracy and the Area Under the Precision-Recall Curve (AUPRC).

Thus, the architecture featuring five residual groups was selected for the final Splam model design, providing the most accurate splice site predition.

Q: What is the difference between two released model, splam.pt and splam_script.pt?

You may have noticed that we have two released Splam models: "splam.pt" and "splam_script.pt".

splam.pt [link] is the original model that requires the original model script to load and run.
splam_script.pt [link] is the Torchscripted Splam model. Torchscript serializes and optimizes PyTorch code for improved performance and deployment. Essentially, it allows you to convert PyTorch code into a more efficient intermediate representation, which can be used for Just-In-Time (JIT) compilation and deployment without the need for the Python interpreter.

Important

In sum, we strongly recommend using splam_script.pt for all users. It provides a faster, portable, and secure way of deploying the model.

Q: Which mode should I run Splam, cpu, cuda, or mps?

By default, Splam automatically detects your environment and runs in cuda mode if CUDA is available. However, if your computer is running macOS, Splam will check if mps mode is available. If neither cuda nor mps are available, Splam will run in cpu mode. You can explicitly specify the mode using the -d / --device argument.

Important

In sum,

if you are using the Apple Silicon Mac, you should run Splam with mps mode.
If you are using Linux with CUDA installed, you should run Splam with cuda mode.
If you are none of the above cases, then you can still run Splam with cpu` mode.

You can check out the Pytorch website for more explanation about the device parameter.

Q: How do I interpret Splam scores?

Given an input of length 800nt, Splam outputs a Tensor with dimensions (3 x 800). The first channel represents the "acceptor scores", the second channel represents the "donor scores", and the third channel represents the "non-splice site scores". Each score is between 0 and 1, representing Splam's confidence in a given site being a splice site. A score closer to one indicates a higher level of confidence in its classification.

Q: How is Splam trained?

Check out the splam training and testing section.