CHESS contains virtually all genes from RefSeq (as of mid-2017) and GENCODE. It adds 224 protein-coding genes and 2,671 lncRNAs based on strong experimental and alignment evidence, as described in our 2018 Genome Biology paper (see reference below).
CHESS 2.2 data
Content | Description | Download |
---|---|---|
CHESS gene annotation |
This file contains the primary gene set described in the CHESS paper, in
GFF format.
All genes and transcripts are mapped onto human genome release GRCh38.p8.
Included in this file are genes on the reference chromosomes, unmapped scaffolds, assembly patches,
and alternate loci.
This data set can be visualized in the UCSC Genome Browser. |
chess2.2.gff.gz chess2.2.gtf.gz (35 MB download, >1GB uncompressed) |
CHESS gene list | This file is a table showing all 42,611 genes in CHESS release 2.2, in a tab-delimited text file with one gene per line. For each gene it provides features such as gene ID, type, gene name, source of the annotation, location(s), GFF ID(s), and a free text description of the gene. | chess2.2.genes |
CHESS proteins |
This FASTA file contains the non-redundant sequences of all the proteins translated from the CHESS protein-coding
genes (94188 sequences). |
chess2.2.protein.all.fa.gz |
Subset of all non-redundant protein sequences. Contains the longest protein sequence for each gene locus that has more than one protein (e.g., splice variants). For a given gene locus all non-redundant longest proteins are included (21265 sequences). | chess2.2.protein.longest.fa.gz | |
Gene annotation for transcriptome assembly | This is a subset of the gene annotation GFF file (chess2.2.gff), containing annotations only on the reference chromosomes and the mitochondrion. It also includes the tRNA and rRNA gene annotations from RefSeq. We recommend using this file with transcriptome assemblers such as StringTie or Cufflinks. | chess2.2_assembly.gff.gz chess2.2_assembly.gtf.gz (34 MB download) |
CHESS plus RefSeq gene annotations | This is a superset of chess2.2.gff. It adds multiple other gene types annotated in Refseq that are not included in CHESS, such as pseudogenes, V|J|D|C segements, snoRNAs, snRNAs, telomerase RNAs, guide RNAs, etc. Note that many of these elements (e.g., pseudogenes) are not actually genes, but they are included here for users who want everything in RefSeq plus the additional genes in CHESS. | chess2.2_and_refseq.gff.gz chess2.2_and_refseq.gtf.gz (37 MB download) |
GRCh38 patch 8 | This is the version of the reference assembly of the human genome to which coordinates in CHESS annotation correspond. | hg38_p8.fa.gz (947 MB download) |
Mapfile | Provides easy conversion between RefSeq, GENCODE and CHESS IDs | chess2.2.mapfile.txt (5.4 MB download) |
A note about fusion genes
Fusion genes are genes created by a mutation that brings together two genes from two different chromosomes, or from normally non-adjacent regions of the same chromosome. These abnormal genes can be found in specialized fusion gene databases, but because they do not occur on a normal (non-mutated) genome, we have not included them in CHESS.
News
▸ 11/11/2021 - Web database interface added
▹ Available at chess2.2.shtml
▹ Search genes/transcripts by CHESS ID or Gene ID,Type,Name,Location,Description,Database
▸ 10/14/2019 - Protein redundancy fix
▹ Redundant protein sequences removed from the .proteins.fa file
▹ Two files generated one describing only the longest proteins per gene locus (21265) and the other containing all non-redundant proteins (94188)
▹ Added "|"-separated list of supporting transcripts for each sequence in the protein.*.fasta files.
▹ Added "gene_name" attribute for each non-novel entry in the protein.*.fasta files.
▸ 04/19/2019 - Release 2.2
▹ Wrong parent ID assignments have been detected and corrected for several refseq transcripts
▹ CDS annotations introduced for novel isoforms in known protein-coding genes, which are start codon and intron-chain compatible with at least one known open reading frame from Gencode or Refseq
▹ Coordinate overlaps in CDS entries that are parented by the same isoform have been removed by enforcing the same coordinates as the corresponding exons
▹ Missing RefSeq CDS assignments have been re-introduced
▹ Missing RefSeq and Gencode attributes have been re-introduced
▸ 09/06/2018 - Release 2.1
▹ 551 genes removed based on more conservative criteria
▹ 551 transcripts removed
▹ 403 potentially protein-coding genes re-labeled as lncRNA
▹ 578 potentially protein-coding transcripts re-labeled as lncRNA
▹ Erroneous multi-exon transcripts removed from the annotation of the mitochondrial scaffold
▹ Standardized use of special characters in the attributes field
▹ Extended 3' and 5' ends of novel terminal exons by <=100bp to accomodate known open reading frames
▹ Removed erroneous CDS annotations for non-coding genes
▸ 06/19/2018 - Escape Character Fix
▹ Special characters (= & ;) in the attributes field replaced with the corresponding URL escape codes in accordance with the GFF3 specification.
▸ 05/29/2018 - Bug fixes
▹ Decimal point removed from start and end coordinates of tRNA and rRNA features in chess2.1_assembly.gff.
▹ Corrections to version numbers in GFF files.
▸ 05/28/2018 - BioRxiv preprint
▹ Preprint of the CHESS manuscript is available on BioRxiv.
Chess 1.0
Initial version of the CHESS annotation is available on our temporary github page.
Citing CHESS
Mihaela Pertea, Alaina Shumate, Geo Pertea, Ales Varabyou, Florian P. Breitwieser, Yu-Chi Chang, Anil K. Madugundu, Akhilesh Pandey, and Steven L. Salzberg. "CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise." Genome Biology 2018, 19:208