Summary of attributes
Attribute | Possible Values | Explanation |
---|---|---|
transcript_id | alphanumeric string | Unique identifier for each transcript. |
gene_id | alphanumeric string | Unique identifier for each gene. |
gene_name | alphanumeric string | Symbolic name for the gene. |
gene_type | transcribed_pseudogene, miRNA, lncRNA, pseudogene, protein_coding, snRNA, snoRNA, antisense_RNA, ncRNA, tRNA, TEC, ncRNA_pseudogene, misc_RNA, V_segment, rRNA, other, C_region, J_segment, V_segment_pseudogene, telomerase_RNA, vault_RNA, D_segment, J_segment_pseudogene, Y_RNA, RNase_MRP_RNA, scRNA, RNase_P_RNA, C_region_pseudogene. | Gene biotype as defined by external sources |
db_xref | alphanumeric string | Other known identifiers for the transcript. |
assembly_id | alphanumeric string | Identifier used to track transcripts across assembly levels. This identifier allows matching against other files we provide, such as the TieBrush filtered set of ~1 mil transcripts or the full set of ~26 mil assembled isoforms. |
original_source | BestRefSeq, HAVANA, Curated Genomic, havana, Gnomon, ensembl_havana, cmsearch, StringTie, FANTOM, tRNAscan-SE, ensembl, ENSEMBL, RefSeq | The annotation source in which assembled transcripts were matched or the transcript was borrowed from. |
max_TPM | numeric value | Maximum value of Transcripts Per Million (TPM) in assembled transcripts. |
sample_count | numeric value | Number of samples where this transcript was expressed. |
tag | MANE_Select, partial, duplicated_transcript | Additional information about the transcript. |
description | alphanumeric string | Extended description of the gene based on the RefSeq annotation. Only assigned to gene records |
Extended desciption of gene types
Gene Type | Count | Explanation |
---|---|---|
transcribed_pseudogene | 1954 | A gene that has sequence similarities with known pseudogenes, but has transcription evidence. |
miRNA | 5151 | micro-RNA genes. |
lncRNA | 36356 | A long non-coding RNA that does not encode for a protein, but has various functions in gene regulation. |
pseudogene | 16521 | A gene that has lost its protein-coding ability, but still has sequence similarities with known coding genes. |
protein_coding | 105328 | Protein-coding genes. |
snRNA | 159 | A gene that codes for a small nuclear RNA. |
snoRNA | 1251 | A gene that codes for a small nucleolar RNA. |
antisense_RNA | 37 | A gene that produces a non-coding RNA complementary to another RNA, resulting in gene regulation. |
ncRNA | 28 | A gene that codes for a non-coding RNA with various functions, such as gene regulation and RNA processing. |
tRNA | 643 | Transfer RNA genes |
TEC | 28 | Transcription elongation factor protein-coding genes |
ncRNA_pseudogene | 1 | Non-coding RNA pseudogene genes |
misc_RNA | 83 | Miscellaneous RNA genes |
V_segment | 342 | Immunoglobulin variable gene segments |
rRNA | 40 | Ribosomal RNA genes |
other | 24 | Genes that do not fit into other categories |
C_region | 36 | Immunoglobulin constant gene regions |
J_segment | 117 | Immunoglobulin joining gene segments |
V_segment_pseudogene | 282 | Immunoglobulin variable pseudogene genes |
telomerase_RNA | 1 | Telomerase RNA genes |
vault_RNA | 4 | Vault RNA genes |
D_segment | 61 | Immunoglobulin diversity gene segments |
J_segment_pseudogene | 11 | Immunoglobulin joining pseudogene genes |
Y_RNA | 4 | Y RNA genes |
RNase_MRP_RNA | 1 | Ribonuclease MRP RNA genes |
scRNA | 4 | Small cytoplasmic RNA genes |
scRNA | 4 | Small cytoplasmic RNA genes that are transcribed by RNA polymerase III and often involved in processing of other RNA molecules. |
RNase_P_RNA | 1 | Ribonuclease P RNA genes |
C_region_pseudogene | 7 | A pseudogene that is derived from an immunoglobulin constant region gene. |
Types of Sources used in generation of the CHESS 3 dataset
Source | Count | Description |
---|---|---|
BestRefSeq | 72360 | Automated computational gene-prediction method by NCBI. |
HAVANA | 12184 | A set of annotations generated by the HAVANA group at the Wellcome Trust Sanger Institute. |
Curated Genomic | 16335 | NCBI curation |
Gnomon | 16555 | Gene predictions generated by the NCBI's Gnomon pipeline. |
ensembl_havana | 14981 | Same as HAVANA |
cmsearch | 1184 | Annotations generated by searching for conserved RNA structures using the Infernal software package. |
StringTie | 33783 | Gene predictions generated by the StringTie software. |
FANTOM | 318 | CHESS 2 transcripts with corroborating evidence from the FANTOM project. |
tRNAscan-SE | 621 | Predicted tRNA genes identified by the tRNAscan-SE software. |
ENSEMBL | 117 | Gene predictions generated by the Ensembl project. |
RefSeq | 37 | The NCBI RefSeq annotation set. |