Annotation file / assembled transcripts evaluation (GFF
)#
You have RNA-Seq reads and are running an RNA-Seq analysis pipeline. You have performed quality control, aligned the reads to the genome, and assembled the alignments into transcripts. Now, you have obtained a GFF
or GTF
file generated by a transcriptome assembler. The next question you want to address is the quality of these assembled transcripts.
A common approach in the pipeline involves using gffcompare
to compare the assembled transcripts with a reference annotation file, such as RefSeq, Gencode, or CHESS. However, this method assumes the correctness of the annotation file and evaluates how closely our assembled transcripts align with the annotations.
It is important to acknowledge that annotation files can contain misannotations and noise. Even the most meticulously curated human annotations are not exempt from errors. Moreover, there are even more errors in other species' annotation files.
Important
We provide a new approach to evaluate transcripts by counting how many bad splice junctions are in a transcript!
You can run Splam on (1) annotation files or (2) assembled transcripts. Splam outputs the score of every donor and acceptor site. By using these scores,
You can get an overview of the quality of all introns (splice junctions) inside your annotation file.
You can assess each transcript by checking how many bad splice junctions a transcript has, which can be a good filtering criteria for assembled transcripts.
There are more potential usages to be explored!
Step 1: Preparing your input files#
The first step is to prepare three files for Splam analysis. In this tutorial, we will be using a toy dataset.
Input files
An annotation file in
GFF
orGTF
format [example file: refseq_110_GRCh38_chr_fixed.gff].A reference genome in
FASTA
format [example file: chr9_subset.fa].The Splam model, which you can find it here: splam.pt
Step 2: Extracting introns in your annotation file#
In this step, you take an annotation file (1) and run
splam extract refseq_40_GRCh38.p14_chr_fixed.gff -o tmp_out_annotation
Splam uses gffutils
to extract introns from all transcripts. It starts by creating an sqlite3 database and then iterates through all exons in the GFF
file and writes intron coordinates into a BED
file.
By default, the BED
is written into tmp_out/junction.bed
. The BED
file consists of six columns: CHROM
, START
, END
, JUNC_NAME
, INTRON_NUM
, and STRAND
. Here are a few entries from the BED
file:
Output: junction.bed
1chr9 4849549 4860125 JUNC00000007 3 +
2chr9 5923308 5924658 JUNC00000008 6 -
3chr9 5924844 5929044 JUNC00000009 8 -
Here are some optional arguments:
Step 3: Scoring extracted introns#
In this step, the goal is to score all the extracted splice junctions. To accomplish this, you will need 3 essential files. (1) The BED file that was generated in Step 2, (2) the reference genome (2) which shares coordinates with the junction BED file, and (3) the Splam model (3). Once you have these files in place, you can run the following command:
splam score -G chr9_subset.fa -m ../model/splam_script.pt -o tmp_out_annotation tmp_out_annotation/junction.bed
In this step, a new BED
file is produced, featuring eight columns. Two extra columns, namely DONOR_SCORE
and ACCEPTOR_SCORE
, are appended to the file. It is worth noting that any unstranded introns are excluded from the output. (P.S. They might be from unstranded transcripts assembled by StringTie).
Output: junction_score.bed
1chr9 4849549 4860125 JUNC00000007 3 + 0.7723698 0.5370769
2chr9 5923308 5924658 JUNC00000008 6 - 0.9999831 0.9999958
3chr9 5924844 5929044 JUNC00000009 8 - 0.9999883 0.9999949
Here are the required arguments:
Here are some optional arguments:
Step 4: Evaluating isoforms by Splam scores#
To summarize the quality of each isoform, users can run Splam to count how many spurious splice junctions are in each transcript.
splam clean -o tmp_out_annotation -t 0.8
Output: cleaned.gff
The output file of this step is a sorted Splam-cleaned GFF file. You can replace the original GFF file with the cleaned GFF file. Also, reference the report.tsv
for a list of transcripts, and their ratio and number of spurious/removed junctions.
Splam score threshold suggestion
For evaluating the accuracy of GFF annotation files, we advise using a stricter score threshold of 0.8.
Here are some optional arguments:
What's next?#
Congratulations! You have finished this tutorial.
See also
Behind the scenes to understand how Splam is designed and trained
Q & A ... to check out some common questions