Summary of attributes
| Attribute | Possible Values | Explanation |
|---|---|---|
| transcript_id | alphanumeric string | Unique identifier for each transcript. |
| gene_id | alphanumeric string | Unique identifier for each gene. |
| gene_name | alphanumeric string | Symbolic name for the gene. |
| gene_type | transcribed_pseudogene, miRNA, lncRNA, pseudogene, protein_coding, snRNA, snoRNA, antisense_RNA, ncRNA, tRNA, TEC, ncRNA_pseudogene, misc_RNA, V_segment, rRNA, other, C_region, J_segment, V_segment_pseudogene, telomerase_RNA, vault_RNA, D_segment, J_segment_pseudogene, Y_RNA, RNase_MRP_RNA, scRNA, RNase_P_RNA, C_region_pseudogene. | Gene biotype as defined by external sources |
| db_xref | alphanumeric string | Other known identifiers for the transcript. |
| assembly_id | alphanumeric string | Identifier used to track transcripts across assembly levels. This identifier allows matching against other files we provide, such as the TieBrush filtered set of ~1 mil transcripts or the full set of ~26 mil assembled isoforms. |
| original_source | BestRefSeq, HAVANA, Curated Genomic, havana, Gnomon, ensembl_havana, cmsearch, StringTie, FANTOM, tRNAscan-SE, ensembl, ENSEMBL, RefSeq | The annotation source in which assembled transcripts were matched or the transcript was borrowed from. |
| max_TPM | numeric value | Maximum value of Transcripts Per Million (TPM) in assembled transcripts. |
| sample_count | numeric value | Number of samples where this transcript was expressed. |
| tag | MANE_Select, partial, duplicated_transcript | Additional information about the transcript. |
| description | alphanumeric string | Extended description of the gene based on the RefSeq annotation. Only assigned to gene records |
Extended desciption of gene types
| Gene Type | Count | Explanation |
|---|---|---|
| transcribed_pseudogene | 1954 | A gene that has sequence similarities with known pseudogenes, but has transcription evidence. |
| miRNA | 5151 | micro-RNA genes. |
| lncRNA | 36356 | A long non-coding RNA that does not encode for a protein, but has various functions in gene regulation. |
| pseudogene | 16521 | A gene that has lost its protein-coding ability, but still has sequence similarities with known coding genes. |
| protein_coding | 105328 | Protein-coding genes. |
| snRNA | 159 | A gene that codes for a small nuclear RNA. |
| snoRNA | 1251 | A gene that codes for a small nucleolar RNA. |
| antisense_RNA | 37 | A gene that produces a non-coding RNA complementary to another RNA, resulting in gene regulation. |
| ncRNA | 28 | A gene that codes for a non-coding RNA with various functions, such as gene regulation and RNA processing. |
| tRNA | 643 | Transfer RNA genes |
| TEC | 28 | Transcription elongation factor protein-coding genes |
| ncRNA_pseudogene | 1 | Non-coding RNA pseudogene genes |
| misc_RNA | 83 | Miscellaneous RNA genes |
| V_segment | 342 | Immunoglobulin variable gene segments |
| rRNA | 40 | Ribosomal RNA genes |
| other | 24 | Genes that do not fit into other categories |
| C_region | 36 | Immunoglobulin constant gene regions |
| J_segment | 117 | Immunoglobulin joining gene segments |
| V_segment_pseudogene | 282 | Immunoglobulin variable pseudogene genes |
| telomerase_RNA | 1 | Telomerase RNA genes |
| vault_RNA | 4 | Vault RNA genes |
| D_segment | 61 | Immunoglobulin diversity gene segments |
| J_segment_pseudogene | 11 | Immunoglobulin joining pseudogene genes |
| Y_RNA | 4 | Y RNA genes |
| RNase_MRP_RNA | 1 | Ribonuclease MRP RNA genes |
| scRNA | 4 | Small cytoplasmic RNA genes |
| scRNA | 4 | Small cytoplasmic RNA genes that are transcribed by RNA polymerase III and often involved in processing of other RNA molecules. |
| RNase_P_RNA | 1 | Ribonuclease P RNA genes |
| C_region_pseudogene | 7 | A pseudogene that is derived from an immunoglobulin constant region gene. |
Types of Sources used in generation of the CHESS 3 dataset
| Source | Count | Description |
|---|---|---|
| BestRefSeq | 72360 | Automated computational gene-prediction method by NCBI. |
| HAVANA | 12184 | A set of annotations generated by the HAVANA group at the Wellcome Trust Sanger Institute. |
| Curated Genomic | 16335 | NCBI curation |
| Gnomon | 16555 | Gene predictions generated by the NCBI's Gnomon pipeline. |
| ensembl_havana | 14981 | Same as HAVANA |
| cmsearch | 1184 | Annotations generated by searching for conserved RNA structures using the Infernal software package. |
| StringTie | 33783 | Gene predictions generated by the StringTie software. |
| FANTOM | 318 | CHESS 2 transcripts with corroborating evidence from the FANTOM project. |
| tRNAscan-SE | 621 | Predicted tRNA genes identified by the tRNAscan-SE software. |
| ENSEMBL | 117 | Gene predictions generated by the Ensembl project. |
| RefSeq | 37 | The NCBI RefSeq annotation set. |