Mouse to Rat#
Mus musculus to Rattus norvegicus
Input files#
To run this example, you will need to download the following three input files:
- Input
target Genome \(T\) in FASTA : mRatBN7.2_genomic.fna
reference Genome \(R\) in FASTA : GRCm39_genomic.fna
reference Annotation \(R_A\) in GFF3 : GRCm39_genomic.gff
There is only one command you need to run LiftOn:
lifton -g GRCm39_genomic.gff -o lifton.gff3 -copies mRatBN7.2_genomic.fna GRCm39_genomic.fna
After successfully running LiftOn, you will get the following file and output directory:
- Output:
LiftOn annotation file in GFF3: ftp://ftp.ccb.jhu.edu/pub/data/LiftOn/mouse_to_rat/lifton.gff3
LiftOn output directory: ftp://ftp.ccb.jhu.edu/pub/data/LiftOn/mouse_to_rat/lifton_output/
Results#
Genome annotation evaluation#
Here are some visualization results comparing LiftOn annotation to (1) Liftoff and (2) miniprot annotation.
First, we calculate the protein sequence identity score for every protein-coding transcript (check Evaluation metrics - sequence identity section) for three annotations, LiftOn, Liftoff, and miniprot.
Figure 51 compares the protein-coding gene mapping of Liftoff, based on DNA alignment, with miniprot, utilizing protein-to-DNA alignment. Dots in the lower right signify transcripts where Liftoff outperformed miniprot in protein sequence identity, while the upper left indicates transcripts where miniprot excelled. LiftOn employs the PM algorithm to enhance annotations in both, achieving improved protein-coding gene annotation, as neither approach dominates the other.
Next, we individually assess LiftOn in comparison to Liftoff and miniprot. In the comparison of LiftOn versus Liftoff (Figure 52, left), 28865 transcripts demonstrate higher protein sequence identity, with 327 achieving 100% identity. Similarly, in the LiftOn versus miniprot comparison (Figure 52, right), 34305 protein-coding transcripts exhibit superior matches, elevating 855 to identical status relative to the reference.
We visualize the transcripts in a 3-D plot, incorporating LiftOn, Liftoff, and miniprot scores (see Figure Figure 53) to provide a comprehensive comparison of the three tools. If a dot is above the \(x=y\) plane, it indicates that the protein-coding transcript annotation of LiftOn generates a longer valid protein sequence aligning to the full-length reference protein. The 3-D plot reveals that the majority of dots are above the \(x=y\) plane, suggesting that LiftOn annotation is better.
Next, we check the distribution of protein sequence identities (see Figure 54). Among the three tools, LiftOn (middle) exhibits the smallest left tail, with 2635 protein-coding transcripts having a protein sequence identity of \(< 0.4\).
Finding extra copies of lift-over features#
LiftOn also has a module to find extra copies by using intervaltree, Liftoff, and miniprot. The Circos plot in Figure 55 shows their relative positions between the two genomes. The plot illustrates that the extra copies were predominantly located on the same chromosomes in both GRCm39 and mRatBN7.2. The frequency plot of extra copy features are show in Figure 56.
Finally, we examined the order of protein-coding genes (Figure 57) between the two genomes and observed that, as expected, nearly all genes occur in the same order and orientation in both human genomes.
Examples of LiftOn's output#
To demonstrate LiftOn's improvement in comparison to Liftoff and miniprot visually, we extracted excerpts from IGV which contain the annotation tracks of all three methods, side by side (Figure 58).
What's next?#
Congratulations! You have finished this tutorial.
See also
Behind the scenes to understand how LiftOn is designed
FAQ ... to check out some common questions