The Genome Query Language and its Implementation
Speaker: Christos Kozanitis, University of California - San Diego
Date: Tuesday, February 26 2013
Time: 4:00PM to 5:00PM
Refreshments: 3:45PM
Location: 32-G449 (Patil/Kiva)
Host: Samuel Madden, CSAIL
Contact: Sheila Marian, x3-1996, sheila@csail.mit.edu
Abstract: With high throughput DNA sequencing costs dropping below $1,000 for human genomes, data analysis is becoming a major bottleneck in biological studies. My work advocates
a clean separation between evidence collection and inference in variant calling.
I will start by describing the Genome Query Language GQL which allows inference layers to efficiently gather
only the relevant evidence from the raw data. I will give examples to demonstrate how GQL can replace the programming effort required for complex
evidence collection. For example, one can use GQL to query for large structural variations using only 5-10 lines of high level code that takes less that 10 minutes to execute on an input BAM file of 75 GB. I will demonstrate that popular variant callers, such as Breakdancer can achieve a speedup up to 8x by using evidence gathered by GQL. Further, I show how GQL query results can be visualized using the UCSC browser, allowing what I call *semantic* browsing -- as opposed to the syntactic browsing of genomes by location that is the standard today.
I will also describe the implementation of GQL and five optimizations --- including cached parsing, lazy joins, and materialized views --- that we used to speed up query processing by a factor of 1000x. Our results were obtained using a cheap Desktop computer, suggesting that simple parallelization can allow GQL queries to be served in the cloud within a matter of seconds.
BIO: Christos Kozanitis is a PhD candidate at the University of California, San Diego working with Vineet Bafna and George Varghese. Prior to GQL he designed and implemented the SlimGene system for compressing genomes; a variant of the ideas that were introduced by Slimgene is being incorporated into Illumina's pipeline. He also designed and implemented the Kangaroo system for flexible header parsing, aspects of which have influenced current networking chipsets from Cisco and TI.
See other events happening in February 2013