Summary of attributes

Attribute Possible Values Explanation
transcript_id alphanumeric string Unique identifier for each transcript.
gene_id alphanumeric string Unique identifier for each gene.
gene_name alphanumeric string Symbolic name for the gene.
gene_type transcribed_pseudogene, miRNA, lncRNA, pseudogene, protein_coding, snRNA, snoRNA, antisense_RNA, ncRNA, tRNA, TEC, ncRNA_pseudogene, misc_RNA, V_segment, rRNA, other, C_region, J_segment, V_segment_pseudogene, telomerase_RNA, vault_RNA, D_segment, J_segment_pseudogene, Y_RNA, RNase_MRP_RNA, scRNA, RNase_P_RNA, C_region_pseudogene. Gene biotype as defined by external sources
db_xref alphanumeric string Other known identifiers for the transcript.
assembly_id alphanumeric string Identifier used to track transcripts across assembly levels. This identifier allows matching against other files we provide, such as the TieBrush filtered set of ~1 mil transcripts or the full set of ~26 mil assembled isoforms.
original_source BestRefSeq, HAVANA, Curated Genomic, havana, Gnomon, ensembl_havana, cmsearch, StringTie, FANTOM, tRNAscan-SE, ensembl, ENSEMBL, RefSeq The annotation source in which assembled transcripts were matched or the transcript was borrowed from.
max_TPM numeric value Maximum value of Transcripts Per Million (TPM) in assembled transcripts.
sample_count numeric value Number of samples where this transcript was expressed.
tag MANE_Select, partial, duplicated_transcript Additional information about the transcript.
description alphanumeric string Extended description of the gene based on the RefSeq annotation. Only assigned to gene records

Extended desciption of gene types

Gene Type Count Explanation
transcribed_pseudogene 1954 A gene that has sequence similarities with known pseudogenes, but has transcription evidence.
miRNA 5151 micro-RNA genes.
lncRNA 36356 A long non-coding RNA that does not encode for a protein, but has various functions in gene regulation.
pseudogene 16521 A gene that has lost its protein-coding ability, but still has sequence similarities with known coding genes.
protein_coding 105328 Protein-coding genes.
snRNA 159 A gene that codes for a small nuclear RNA.
snoRNA 1251 A gene that codes for a small nucleolar RNA.
antisense_RNA 37 A gene that produces a non-coding RNA complementary to another RNA, resulting in gene regulation.
ncRNA 28 A gene that codes for a non-coding RNA with various functions, such as gene regulation and RNA processing.
tRNA 643 Transfer RNA genes
TEC 28 Transcription elongation factor protein-coding genes
ncRNA_pseudogene 1 Non-coding RNA pseudogene genes
misc_RNA 83 Miscellaneous RNA genes
V_segment 342 Immunoglobulin variable gene segments
rRNA 40 Ribosomal RNA genes
other 24 Genes that do not fit into other categories
C_region 36 Immunoglobulin constant gene regions
J_segment 117 Immunoglobulin joining gene segments
V_segment_pseudogene 282 Immunoglobulin variable pseudogene genes
telomerase_RNA 1 Telomerase RNA genes
vault_RNA 4 Vault RNA genes
D_segment 61 Immunoglobulin diversity gene segments
J_segment_pseudogene 11 Immunoglobulin joining pseudogene genes
Y_RNA 4 Y RNA genes
RNase_MRP_RNA 1 Ribonuclease MRP RNA genes
scRNA 4 Small cytoplasmic RNA genes that are transcribed by RNA polymerase III and often involved in processing of other RNA molecules.
RNase_P_RNA 1 Ribonuclease P RNA genes
C_region_pseudogene 7 A pseudogene that is derived from an immunoglobulin constant region gene.

Types of Sources used in generation of the CHESS 3 dataset

Source Count Description
BestRefSeq 72360 Automated computational gene-prediction method by NCBI.
HAVANA 12184 A set of annotations generated by the HAVANA group at the Wellcome Trust Sanger Institute.
Curated Genomic 16335 NCBI curation
Gnomon 16555 Gene predictions generated by the NCBI's Gnomon pipeline.
ensembl_havana 14981 Same as HAVANA
cmsearch 1184 Annotations generated by searching for conserved RNA structures using the Infernal software package.
StringTie 33783 Gene predictions generated by the StringTie software.
FANTOM 318 CHESS 2 transcripts with corroborating evidence from the FANTOM project.
tRNAscan-SE 621 Predicted tRNA genes identified by the tRNAscan-SE software.
ENSEMBL 117 Gene predictions generated by the Ensembl project.
RefSeq 37 The NCBI RefSeq annotation set.


