Lecture Title: Indexing Genomic Databases
Presented By: Travis Gagie, Universidad Diego Portales, Santiago, Chile
Location: Roseburgh Building, 164 College Street, Room 208
Abstract: Fifteen years after the first human genome was sequenced, the British government is building a database of a hundred thousand human genomes. Uncompressed, this database will take hundreds of terabytes, so only data structures that take advantage of the genetic similarity between humans will be able to handle it. There has been a sustained effort to scale up indexes based on the Burrows-Wheeler Transform (BWT), for two main reasons: such indexes play several important roles in bioinformatics, and run-length compressing the BWTs of genomic databases often reduces their sizes by orders of magnitude. In order for an index to report not just how many time a pattern occurs in the database but also where, however, it needs a suffix array (SA) sample in addition to a BWT. Until very recently, it was not known how to make an SA sample very small without also making it very slow. In this talk we will review what BWT-based indexes do, how they work, how they use standard SA samples, and why run-length compression works so well on them, and then see how to build a fast SA sample whose size is bounded in terms of the number of runs in the BWT. This is joint work with Gonzalo Navarro and Nicola Prezza, presented at SODA 2018.
For more information, visit the Theory Group Events Page.