Annotation file / assembled transcripts evaluation (GFF)#

You have RNA-Seq reads and are running an RNA-Seq analysis pipeline. You have performed quality control, aligned the reads to the genome, and assembled the alignments into transcripts. Now, you have obtained a GFF or GTF file generated by a transcriptome assembler. The next question you want to address is the quality of these assembled transcripts.

A common approach in the pipeline involves using gffcompare to compare the assembled transcripts with a reference annotation file, such as RefSeq, Gencode, or CHESS. However, this method assumes the correctness of the annotation file and evaluates how closely our assembled transcripts align with the annotations.

It is important to acknowledge that annotation files can contain misannotations and noise. Even the most meticulously curated human annotations are not exempt from errors. Moreover, there are even more errors in other species' annotation files.

Important

We provide a new approach to evaluate transcripts by counting how many bad splice junctions are in a transcript!

You can run Splam on (1) annotation files or (2) assembled transcripts. Splam outputs the score of every donor and acceptor site. By using these scores,

  1. You can get an overview of the quality of all introns (splice junctions) inside your annotation file.

  2. You can assess each transcript by checking how many bad splice junctions a transcript has, which can be a good filtering criteria for assembled transcripts.

  3. There are more potential usages to be explored!


Step 1: Preparing your input files#

The first step is to prepare three files for Splam analysis. In this tutorial, we will be using a toy dataset.

Input files

  1. An annotation file in GFF or GTF format [example file: refseq_110_GRCh38_chr_fixed.gff].

  2. A reference genome in FASTA format [example file: chr9_subset.fa].

  3. The Splam model, which you can find it here: splam.pt


Step 2: Extracting introns in your annotation file#

In this step, you take an annotation file (1) and run

splam extract refseq_40_GRCh38.p14_chr_fixed.gff -o tmp_out_annotation

Splam uses gffutils to extract introns from all transcripts. It starts by creating an sqlite3 database and then iterates through all exons in the GFF file and writes intron coordinates into a BED file.

By default, the BED is written into tmp_out/junction.bed. The BED file consists of six columns: CHROM, START, END, JUNC_NAME, INTRON_NUM, and STRAND. Here are a few entries from the BED file:

Output: junction.bed

1chr9    4849549 4860125 JUNC00000007    3       +
2chr9    5923308 5924658 JUNC00000008    6       -
3chr9    5924844 5929044 JUNC00000009    8       -

Here are some optional arguments:


Step 3: Scoring extracted introns#

In this step, the goal is to score all the extracted splice junctions. To accomplish this, you will need 3 essential files. (1) The BED file that was generated in Step 2, (2) the reference genome (2) which shares coordinates with the junction BED file, and (3) the Splam model (3). Once you have these files in place, you can run the following command:

splam score -G chr9_subset.fa -m ../model/splam_script.pt -o tmp_out_annotation tmp_out_annotation/junction.bed

In this step, a new BED file is produced, featuring eight columns. Two extra columns, namely DONOR_SCORE and ACCEPTOR_SCORE, are appended to the file. It is worth noting that any unstranded introns are excluded from the output. (P.S. They might be from unstranded transcripts assembled by StringTie).

Output: junction_score.bed

1chr9    4849549 4860125 JUNC00000007    3       +       0.7723698       0.5370769
2chr9    5923308 5924658 JUNC00000008    6       -       0.9999831       0.9999958
3chr9    5924844 5929044 JUNC00000009    8       -       0.9999883       0.9999949

Here are the required arguments:

Here are some optional arguments:


Step 4: Evaluating isoforms by Splam scores#

To summarize the quality of each isoform, users can run Splam to count how many spurious splice junctions are in each transcript.

splam clean -o tmp_out_annotation -t 0.8

Output: cleaned.gff

The output file of this step is a sorted Splam-cleaned GFF file. You can replace the original GFF file with the cleaned GFF file. Also, reference the report.tsv for a list of transcripts, and their ratio and number of spurious/removed junctions.

Splam score threshold suggestion

For evaluating the accuracy of GFF annotation files, we advise using a stricter score threshold of 0.8.

Here are some optional arguments:


What's next?#

Congratulations! You have finished this tutorial.

See also