T2T Annotation

RefSeq annotation of the CHM13 genome

The RefSeq annotation of the CHM13 genome is available here. As of July 2023, this is version 5.1 which fixes some special character issues in RefSeq. The annotation can also be found on the T2T CHM13 github site, https://github.com/marbl/CHM13.

This annotation was created by using the Liftoff program [1]; (v1.6.3, with options -copies -sc 0.95 -polish -exclude_partial -chroms) to map across all human genes in RefSeq [2] annotation release 110 from the GRCh38.p14 genome to the CHM13v2.0 genome, a complete, gap-free human genome published by the Telomere-to-Telomere (T2T) consortium [3].

Of the 58,700 genes (19,871 protein coding) and 179,372 transcripts (130,361 protein coding) annotated on the main chromosomes and unplaced contigs in GRCh38, we successfully lifted over 58,272 genes (19,767 protein coding) and 178,688 transcripts (130,426 protein coding). We also used the "-copies" option in Liftoff to identify additional gene copies in CHM13, using a minimum sequence identity threshold of 95% at the DNA level. With this threshold, we found 2,393 additional gene copies (239 protein coding). Finally, we added the ribosomal DNA (rDNA) annotations on the acrocentric chromosomes from the CHM13v2.0 annotation, which were created using the Comparative Annotation Toolkit (CAT) [4]; and Liftoff. These steps resulted in a total gene count of 61,322 (20,006 protein coding) and a total transcript count of 181,715 (130,426 protein coding). The table below shows the number of genes of each biotype in GRCh38 and CHM13.

Gene biotype	Number of genes in GRCh38	Number of genes mapped onto CHM13
protein coding	19871	20006
lncRNA	17793	18389
pseudogene	15357	16030
miRNA	1914	2047
transcribed pseudogene	1221	1262
snoRNA	1195	1188
tRNA	431	522
V segment	239	245
V segment pseudogene	189	209
snRNA	153	192
J segment	98	79
ncRNA	51	49
rRNA	38	325
misc RNA	36	37
D segment	32	0
C region	21	23
antisense RNA	19	19
other	14	13
J segment pseudogene	7	6
C region pseudogene	5	5
Y RNA	4	7
vault RNA	4	4
scRNA	4	4
telomerase RNA	1	1
RNase P RNA	1	1
RNase MRP RNA	1	1
ncRNA pseudogene	1	1
Total	58700	60655

A comparison of the genes using gene IDs to match annotations is shown in the figure below. Genes were compared between RefSeq v110 on GRCh38, the Liftoff version of RefSeq on CHM13, and the NCBI version of RefSeq on CHM13. The NCBI version, shown in blue, was produced using the Gnomon pipeline, which only mapped approximately half of the protein coding genes.

Figure: Venn diagrams showing the overlaps between the RefSeq annotation of GRCh38 (yellow), the RefSeq annotation of CHM13 produced here by Liftoff (red), and the NCBI RefSeq annotation produced internally at NCBI (blue) using the Gnomon pipeline. The large orange regions show the near-complete overlap between our RefSeq annotation (Liftoff-based) and the GRCh38 annotation.

1. Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).

2. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44, D733–D745 (2016).

3. Nurk, S. et al. The complete sequence of a human genome. Science (1979) 376, 44–53 (2022).

4. Fiddes, I. T. et al. Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome Research 28, 1029–1038 (2018).