From HISAT-genotype
Jump to: navigation, search

HISAT-genotype and its website is an online collaborative hub where researchers around the globe can work together in developing a practical and accurate platform that will eventually be capable of analyzing an individual human genome with its over 20,000 genes within just a few hours on a personal computer. The major hurdle in developing such a platform is that we do not have a centralized database for numerous genomic variants in human populations. Instead, each database has its own data format and naming conventions. Thus, we need a common base for representing such diverse data types upon which we can develop algorithms. There are two main parts of the HISAT-genotype platform where researchers with domain knowledge can contribute: (i) parsing exterior databases for human genes or genomic regions and (ii) customizing and generating output in formats relevant to researchers and clinicians.

Platform Overall1.1.png

Platform Overall2.png


Parsing External Databases

The arrows labeled 1 in the figure above are placeholders in which we want to write scripts for parsing and translating data from external database sources into our own internal database, hisatgenotype_db. For example, we have scripts available for parsing 13 DNA fingerprinting loci and the CYP gene family. Please refer to the Genome Analysis section for setup instructions.

For 13 DNA fingerprinting loci,

 genome-analysis$ hisat-genotype-top/hisatgenotype_modules/

For the CYP gene family,

 genome-analysis$ hisat-genotype-top/hisatgenotype_modules/

Extracting SNPs, Haplotypes, etc.

These extractions are represented by the arrows labeled 2 in the above figure.

For the HLA gene family,

 genome-analysis$ --base hla --min-var-freq 0.1

For 13 DNA fingerprinting loci,

 genome-analysis$ --base codis --whole-haplotype --leftshift

For the CYP gene family,

 genome-analysis$ --base cyp

Building and Indexing a Genotype Genome

Arrow 3 refers to building and indexing a Genotype genome. A Genotype genome is a graph genome that is specifically designed to aid in carrying out genotyping. In addition to variants and haplotypes, the genotype genome includes some additional sequences inside the backbone sequence shown in yellow, resulting in substantial differences in coordinates with respect to the human reference genome. Thus, please be aware that the genotype genome should not be used for purposes other than genotyping analysis.

 genome-analysis$ -p 4 --base genotype_genome --database-list hla,codis,cyp --commonvar

Please also refer to the Building a graph reference.

Performing a Genome-wide Analysis

HISAT-genotype is based on a novel method (indicated by arrow 4), HISAT2, for representing and searching a significantly expanded model of the human reference genome using a graph, in which a comprehensive catalog of known genomic variants and haplotypes is incorporated into the data structure used for searching and alignment. This new way of representing a population of genomes, along with a very fast and memory-efficient search capability, enables more detailed and accurate variant analyses than previous methods.

 genome-analysis$ -p 4 --base genotype_genome -1 1.fq.gz -2 2.fq.gz

Customizing Output

From the intermediate output, we, especially those with domain knowledge, want to write python scripts to generate output in formats (arrows labeled 5) that are relevant to researchers and clinicians.