CHESS 2.0 is a comprehensive set of human genes based on nearly 10,000 RNA sequencing experiments produced by the GTEx project. It includes a total of 21,306 protein-coding genes and 18,484 lncRNA genes. Adding antisense and other RNA genes, release 2.0 of the database contains 43,162 genes and 323,827 transcripts. Of these transcripts, 267,478 represent protein-coding gene isoforms and the rest are noncoding RNAs.

CHESS contains virtually all genes from RefSeq (as of mid-2017) and GENCODE. It adds 1,178 protein-coding genes and 3,819 lncRNAs based on strong experimental and alignment evidence, as described in a forthcoming paper (see below).

CHESS 2.0 data

Content Description Download
CHESS gene annotation This file contains the primary gene set described in the CHESS paper, in GFF format. All genes and transcripts are mapped onto human genome release GRCh38.p8. Included in this file are genes on the reference chromosomes, unmapped scaffolds, assembly patches, and alternate loci. chess2.0.gff.gz (35 MB download, >1GB uncompressed)
CHESS gene list This file is a table showing all 43,162 genes in CHESS release 2.0, in a tab-delimited text file with one gene per line. For each gene it provides features such as gene ID, type, gene name, source of the annotation, location(s), GFF ID(s), and a free text description of the gene. chess2.0.genes
CHESS proteins This FASTA file contains the sequences of all the proteins translated from the CHESS protein-coding genes. For each gene locus that has more than one protein (e.g., splice variants), the longest protein sequence is provided. chess2.0.protein.fa.gz
Gene annotation for transcriptome assembly This is a subset of the gene annotation GFF file (chess2.0.gff), containing annotations only on the reference chromosomes and the mitochondrion. It also includes the tRNA and rRNA gene annotations from RefSeq. We recommend using this file with transcriptome assemblers such as StringTie or Cufflinks. chess2.0_assembly.gff.gz (33 MB download)
CHESS plus RefSeq gene annotations This is a superset of chess2.0.gff. It adds multiple other gene types annotated in Refseq that are not included in CHESS, such as pseudogenes, V|J|D|C segements, snoRNAs, snRNAs, telomerase RNAs, guide RNAs, etc. Note that many of these elements (e.g., pseudogenes) are not actually genes, but they are included here for users who want everything in RefSeq plus the additional genes in CHESS. chess2.0_and_refseq.gff.gz (47 MB download)
GRCh38 patch 8 This is the version of the reference assembly of the human genome to which coordinates in CHESS annotation correspond. hg38_p8.fa.gz (947 MB download)

A note about fusion genes

Fusion genes are genes created by a mutation that brings together two genes from two different chromosomes, or from normally non-adjacent regions of the same chromosome. These abnormal genes can be found in specialized fusion gene databases, but because they do not occur on a normal (non-mutated) genome, we have not included them in CHESS.


06/19/2018 - Escape Character Fix

▹ Special characters (= & ;) in the attributes field replaced with the corresponding URL escape codes in accordance with the GFF3 specification.

05/29/2018 - Bug fixes

▹ Decimal point removed from start and end coordinates of tRNA and rRNA features in chess2.0_assembly.gff.

▹ Corrections to version numbers in GFF files.

05/28/2018 - BioRxiv preprint

▹ Preprint of the CHESS manuscript is available on BioRxiv.

Chess 1.0

Initial version of the CHESS annotation is available on our temporary github page.

Citing CHESS

A paper on CHESS is under review. This site will be updated when the paper is available.


The CHESS project is led by Mihaela Pertea and Steven Salzberg at Johns Hopkins University.