Rather than repeatedly annotating novel genome assemblies, a more efficient strategy involves transferring genes from well-annotated organisms of the same or closely related species.
Liftoff, being entirely DNA-based, utilizes minimap2 to align gene loci DNA sequences to the genome and convert gene coordinates to the new assembly. However, when a newly assembled genome deviates significantly from the reference DNA sequence, the alignment may produce transcripts with incorrect protein-coding sequences or erroneous splice sites, posing challenges in annotation, particularly for more distantly related species.
miniprot, on the other hand, is exclusively protein-based. This approach has limitations. (1) It cannot capture untranslated regions (UTRs), (2) may miss small exons in cases of long introns, (3) is susceptible to aligning proteins to pseudogenes due to the disregard of intronic sequences, and (4) may combine coding sequences (CDSs) from distinct genes when arranged in tandem along a genome. (5) Additionally, it solely applies to protein-coding transcripts, excluding non-coding genes or other features.
To overcome these limitations, we created LiftOn, which combines the advantages of both DNA- and protein-based approaches and applies a two-step protein-maximization (PM) algorithm leading to enhanced protein-coding gene annotation.
Q: How much does LiftOn improve over Liftoff and miniprot?
Here is one example of improvement over human annotation lift-over from GRCh38 to T2T-CHM13.
Each dot represents a protein-coding transcript. If it is above the x=y line, it indicates that the LiftOn annotation possesses a higher protein sequence identity score and corresponds to a longer protein that aligns with the proteins in the reference annotation.
In the LiftOn versus Liftoff comparison (Figure above, left), 2,075 transcripts exhibit higher protein sequence identity, with 460 achieving 100% identity. Similarly, the LiftOn versus miniprot comparison (Figure above, right) discloses better matches for 30,276 protein-coding transcripts, improving 22,616 to identical status relative to the reference.
In summary, LiftOn effectively corrects quite a few protein-coding transcripts during human lift-over. The improvement is even more significant when it comes to more distant species!
Check out the Same species lift-over section, Closely related species lift-over section, and Distantly related species lift-over section for more details.
Q: Can you explain the new protein-maximization (PM) algorithm in LiftOn?
Check out the Protein-maximization algorithm section.
Q: How to you evaluate the lift-over annotation?
Check out the DNA & protein transcript sequence identity score calculation section.
Q: How does LiftOn report mutated genes?
LiftOn compares reference and target transcripts, similar to Liftofftools, generating a mutation report for mapped protein-coding transcripts.
Transcripts are considered "identical" if their target and reference gene DNA sequences match entirely. For mutated sequences, LiftOn categorizes changes as "synonymous", "non-synonymous", "in-frame insertion", "in-frame deletion", "frameshift", "stop codon gain", "stop codon loss", and "start codon loss".