CHESS

Summary of attributes

Attribute	Possible Values	Explanation
transcript_id	alphanumeric string	Unique identifier for each transcript.
gene_id	alphanumeric string	Unique identifier for each gene.
gene_name	alphanumeric string	Symbolic name for the gene.
gene_type	transcribed_pseudogene, miRNA, lncRNA, pseudogene, protein_coding, snRNA, snoRNA, antisense_RNA, ncRNA, tRNA, TEC, ncRNA_pseudogene, misc_RNA, V_segment, rRNA, other, C_region, J_segment, V_segment_pseudogene, telomerase_RNA, vault_RNA, D_segment, J_segment_pseudogene, Y_RNA, RNase_MRP_RNA, scRNA, RNase_P_RNA, C_region_pseudogene.	Gene biotype as defined by external sources
db_xref	alphanumeric string	Other known identifiers for the transcript.
assembly_id	alphanumeric string	Identifier used to track transcripts across assembly levels. This identifier allows matching against other files we provide, such as the TieBrush filtered set of ~1 mil transcripts or the full set of ~26 mil assembled isoforms.
original_source	BestRefSeq, HAVANA, Curated Genomic, havana, Gnomon, ensembl_havana, cmsearch, StringTie, FANTOM, tRNAscan-SE, ensembl, ENSEMBL, RefSeq	The annotation source in which assembled transcripts were matched or the transcript was borrowed from.
max_TPM	numeric value	Maximum value of Transcripts Per Million (TPM) in assembled transcripts.
sample_count	numeric value	Number of samples where this transcript was expressed.
tag	MANE_Select, partial, duplicated_transcript	Additional information about the transcript.
description	alphanumeric string	Extended description of the gene based on the RefSeq annotation. Only assigned to gene records

Extended desciption of gene types

Gene Type	Count	Explanation
transcribed_pseudogene	1954	A gene that has sequence similarities with known pseudogenes, but has transcription evidence.
miRNA	5151	micro-RNA genes.
lncRNA	36356	A long non-coding RNA that does not encode for a protein, but has various functions in gene regulation.
pseudogene	16521	A gene that has lost its protein-coding ability, but still has sequence similarities with known coding genes.
protein_coding	105328	Protein-coding genes.
snRNA	159	A gene that codes for a small nuclear RNA.
snoRNA	1251	A gene that codes for a small nucleolar RNA.
antisense_RNA	37	A gene that produces a non-coding RNA complementary to another RNA, resulting in gene regulation.
ncRNA	28	A gene that codes for a non-coding RNA with various functions, such as gene regulation and RNA processing.
tRNA	643	Transfer RNA genes
TEC	28	Transcription elongation factor protein-coding genes
ncRNA_pseudogene	1	Non-coding RNA pseudogene genes
misc_RNA	83	Miscellaneous RNA genes
V_segment	342	Immunoglobulin variable gene segments
rRNA	40	Ribosomal RNA genes
other	24	Genes that do not fit into other categories
C_region	36	Immunoglobulin constant gene regions
J_segment	117	Immunoglobulin joining gene segments
V_segment_pseudogene	282	Immunoglobulin variable pseudogene genes
telomerase_RNA	1	Telomerase RNA genes
vault_RNA	4	Vault RNA genes
D_segment	61	Immunoglobulin diversity gene segments
J_segment_pseudogene	11	Immunoglobulin joining pseudogene genes
Y_RNA	4	Y RNA genes
RNase_MRP_RNA	1	Ribonuclease MRP RNA genes
scRNA	4	Small cytoplasmic RNA genes
scRNA	4	Small cytoplasmic RNA genes that are transcribed by RNA polymerase III and often involved in processing of other RNA molecules.
RNase_P_RNA	1	Ribonuclease P RNA genes
C_region_pseudogene	7	A pseudogene that is derived from an immunoglobulin constant region gene.

Types of Sources used in generation of the CHESS 3 dataset

Source	Count	Description
BestRefSeq	72360	Automated computational gene-prediction method by NCBI.
HAVANA	12184	A set of annotations generated by the HAVANA group at the Wellcome Trust Sanger Institute.
Curated Genomic	16335	NCBI curation
Gnomon	16555	Gene predictions generated by the NCBI's Gnomon pipeline.
ensembl_havana	14981	Same as HAVANA
cmsearch	1184	Annotations generated by searching for conserved RNA structures using the Infernal software package.
StringTie	33783	Gene predictions generated by the StringTie software.
FANTOM	318	CHESS 2 transcripts with corroborating evidence from the FANTOM project.
tRNAscan-SE	621	Predicted tRNA genes identified by the tRNAscan-SE software.
ENSEMBL	117	Gene predictions generated by the Ensembl project.
RefSeq	37	The NCBI RefSeq annotation set.

Publications

Varabyou, A., Sommer, M. J., Erdogdu, B., Shinder, I., Minkin, I., ... & Pertea, M. (2022). CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure. Genome biology, 24(1), 249.

Pertea, M., Shumate, A., Pertea, G., Varabyou, A., Breitwieser, F. P., Chang, Y. C., ... & Salzberg, S. L. (2018). CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome biology, 19(1), 1-14.

Contact

The CHESS project is led by Mihaela Pertea, Steven Salzberg and Ales Varabyou at Johns Hopkins university.

Issues

Please submit your comments and any issues you encounter when using Chess annotation HERE

Previous Website

The previous version of this webpage can be accessed at: https://ccb.jhu.edu/chess_backup_08102022/index.shtml.