The reference human genome currently used by most researchers is rather biased, since it was assembled using only a few individuals, with over 70% of the reference genome sequence coming from only one person. Thus the reference does not include existing genomic variants of the human population. The alignment protocols based on this single reference genome are often not able to align reads or produce incorrect alignments, especially when reads from regions of a source genome are quite different from the reference genome. These alignment challenges can introduce significant biases during downstream analyses such as during gene expression analysis, variant calling, and screening for genetic disorders.
To address this issue, HISAT2 has been developed for representing and searching a population of human genomes. Basically, given the reference genome as a backbone, known small variants have been incorporated into the reference genome using a graph. In the near future, we plan to add structural variants such as long insertions, deletions, and inversions. We also plan to add sequences not in the reference but from other individuals.
The figure above illustrates how variants are incorporated using a small example in which we have a 6 base long reference, G, A, G, C, T, and G. In the graph representation, bases are represented as nodes and their relationships are represented as edges. We have three variants, one with a single nucleotide polymorphism T instead of the second base A, another with a deletion of one base T between C and G, and one with a base insertion of A between T and G.
Here a path in the graph defines a string. For example, G -> A -> G -> C defines a string, GAGC. Strings are ordered lexicographically, like a standard English dictionary. For example, AGC comes before GTG, which comes before TGZ. A special symbol, Z, is used to indicate the end of the graph, to properly sort strings, and for some other purposes. A fast and memory efficient way is not available for searching the original graph representation. To overcome this challenge, the graph is converted into the one below, called a prefix-sorted graph using a method developed by Sirén et al.
This prefix-sorted graph is more appropriate for search and storage. The prefix-sorted graph is equivalent to the original one in the sense that they define the same set of strings. In a prefix-sorted graph, nodes are sorted in a way such that any strings from a node with a higher rank come before any strings from a node with a lower rank in a lexicographical order. For example, any string from the node ranked first, A, here, such as AGCTAGZ, AGCTGZ, comes before any strings from any other nodes. For this prefix-sorted graph, we have an equivalent table shown below, where we store two types of information. First, for outgoing edges, given node rankings 1 to 11, we store the label of each node according to the number of outgoing edges it has. Here I often refer to node rankings as node IDs. For example, node 1 has one outgoing edge, from A to G, so we store this node’s label A, once. Node 3 has three outgoing edges, so we store this node’s label C 3 times. Second, for incoming edges, given the node rankings, this time we store the labels of the preceding nodes. For example, node 1 has one incoming edge from the node labeled G, so we store G once. Node 5 has two incoming edges from nodes labeled A and T, so we store A and T accordingly.
Notice that although we do not directly store edges using node IDs, from the above table we can easily construct the edge information using a very important property of the table representation, called Last-First (LF) mapping. Last-First mapping is, given the first and last columns, the ith occurrence of a certain label in the last column corresponds to the ith occurrence of that label in the first column. For example, the edge between node 5 and node 3 is implicitly represented in the table using LF mapping. Node 3 has an incoming edge from the node labeled G. Notice this is the second occurrence of G in the last column, which corresponds to node 5 in the first column. All the edges of the graph are implicitly defined using the LF property, which leads to a substantial reduction in storing the table. We also use this LF property to align sequences as described in the Graph Alignment section.
The table representation can be further compacted and stored in a memory efficient way.
Graph FM index (GFM)
One thing to note is that in order to perform LF mapping, we need to know where the ith occurrence of a label is, which involves counting that label from the top of the table. This counting could be time consuming for a large reference such as the human genome. We can speed up the operations by breaking up the table into small blocks of only hundreds of rows. In addition, we store some numbers such as how many As appear up to a particular block, how many Cs appear up to a particular block, and so on. This way, instead of counting all of the labels from the beginning of the table, we just simply use the numbers stored in a block and need to count the number of labels within the block. We also optimized this local counting process. This indexing scheme is called a Graph FM index.
Hierarchical Graph FM index (HGFM)
To further improve both speed and accuracy, we employed the Hierarchical Indexing approach that we originally designed for HISAT to create a Hierarchical Graph FM index (HGFM). In addition to the global index for representing the general human population, we built numerous small indexes for genomic regions that collectively cover the reference genome and variants using this graph-based FM indexing scheme. This approach provides two main advantages: (1) it allows search on a local genomic region (~56,000 bp), which is particularly useful for aligning RNA-seq reads spanning multiple exons, and (2) it provides a much faster lookup compared to a search using the much bigger global index, due to the local index’s small size. In particular, these local indexes are so small that they can fit within the cache memory inside a CPU; cache is significantly faster than standard RAM. Our preliminary implementation of this new scheme uses just 6.2 GB for the index that represent the whole human genome and ~12 million common small variants, requires only 30~80% additional CPU time compared to HISAT, and obtains greater alignment accuracy for reads containing SNPs.