Human (GRCh38) to Chimpanzee#

Input files#

To run this example, you will need to download the following three input files:

Input
1. target Genome \(T\) in FASTA : NHGRI_mPanTro3-v1.1.fna
2. reference Genome \(R\) in FASTA : GCF_000001405.40_GRCh38.p14_genomic.fna
3. reference Annotation \(R_A\) in GFF3 : NCBI_RefSeq_no_rRNA.gff

There is only one command you need to run LiftOn:

lifton -g GRCh38.p14_refseq_genomic.gff -o lifton.gff3 -copies NHGRI_mPanTro3-v1.1.fna GRCh38.p14_refseq_genomic.fna

After successfully running LiftOn, you will get the following file and output directory:

Output:
1. LiftOn annotation file in GFF3: ftp://ftp.ccb.jhu.edu/pub/data/LiftOn/human_to_chimp/lifton.gff3
2. LiftOn output directory: ftp://ftp.ccb.jhu.edu/pub/data/LiftOn/human_to_chimp/lifton_output/
  - score.txt
3. We also convert the LiftOn annotation GFF3 file into BigBed format, allowing users to load the annotation into the UCSC Genome Browser: ftp://ftp.ccb.jhu.edu/pub/data/LiftOn/human_to_chimp/UCSC_genome_browser/.
  - USCU Track Hub quick start guide: UCSC Track Hub

Results#

Genome annotation evaluation#

Here are some visualization results comparing LiftOn annotation to (1) Liftoff and (2) miniprot annotation.

First, we calculate the protein sequence identity score for every protein-coding transcript (check Evaluation metrics - sequence identity section) for three annotations, LiftOn, Liftoff, and miniprot.

Figure 37 compares the protein-coding gene mapping of Liftoff, based on DNA alignment, with miniprot, utilizing protein-to-DNA alignment. Dots in the lower right signify transcripts where Liftoff outperformed miniprot in protein sequence identity, while the upper left indicates transcripts where miniprot excelled. LiftOn employs the PM algorithm to enhance annotations in both, achieving improved protein-coding gene annotation, as neither approach dominates the other.

../../_images/parasail_identities.png — Figure 37 The scatter plot of protein sequence identity comparing between miniprot (y-axis) and Liftoff (x-axis). Each dot represents a protein-coding transcript.#

Next, we individually assess LiftOn in comparison to Liftoff and miniprot. In the comparison of LiftOn versus Liftoff (Figure 38, left), 8710 transcripts demonstrate higher protein sequence identity, with 245 achieving 100% identity. Similarly, in the LiftOn versus miniprot comparison (Figure 38, right), 35167 protein-coding transcripts exhibit superior matches, elevating 6744 to identical status relative to the reference.

../../_images/combined_scatter_plots.png — Figure 38 The scatter plot of protein sequence identity comparing between LiftOn (y-axis) and Liftoff (x-axis) (left) and comparing between LiftOn (y-axis) and miniprot (x-axis) (right).#

We visualize the transcripts in a 3-D plot, incorporating LiftOn, Liftoff, and miniprot scores (see Figure Figure 39) to provide a comprehensive comparison of the three tools. If a dot is above the \(x=y\) plane, it indicates that the protein-coding transcript annotation of LiftOn generates a longer valid protein sequence aligning to the full-length reference protein. The 3-D plot reveals that the majority of dots are above the \(x=y\) plane, suggesting that LiftOn annotation is better.

../../_images/3d_scatter.png — Figure 39 The 3-D scatter plot of protein sequence identity comparing between LiftOn (y-axis), Liftoff (x-axis), and miniprot (z-axis).#

Next, we check the distribution of protein sequence identities (see Figure 40). Among the three tools, LiftOn (middle) exhibits the smallest left tail, with 972 protein-coding transcripts having a protein sequence identity of \(< 0.4\).

../../_images/combined_frequency_log.png — Figure 40 Frequency plots in logarithmic scale of protein sequence identity for Liftoff (left), LiftOn (middle), and miniprot (right) for the results of human_to_chimp lift-over.#

Finding extra copies of lift-over features#

LiftOn also has a module to find extra copies by using intervaltree, Liftoff, and miniprot. The Circos plot in Figure 41 shows their relative positions between the two genomes. The plot illustrates that the extra copies were predominantly located on the same chromosomes in both GRCh38 and NHGRI_mPanTro. The frequency plot of extra copy features are show in Figure 42.

../../_images/circos_plot.png — Figure 41 Circos plot illustrating the locations of extra gene copies found on NHGRI_mPanTro (left side) compared to GRCh38 (right side). Each line shows the location of an extra copy, and lines are color-coded by the chromosome of the original copy.#

../../_images/frequency.png — Figure 42 Frequency plot for additional gene copy.#

Finally, we examined the order of protein-coding genes (Figure 43) between the two genomes and observed that, as expected, nearly all genes occur in the same order and orientation in both human genomes.

../../_images/gene_order_plot.png — Figure 43 Protein-gene order plot, with the x-axis representing the reference genome (GRCh38) and the y-axis representing the target genome (NHGRI_mPanTro). The protein sequence identities are color-coded on a logarithmic scale, ranging from green to red. Green represents a sequence identity score of 1, while red corresponds to a sequence identity score of 0.#

What's next?#

Congratulations! You have finished this tutorial.