A language for bioinformatics


With the vast growth of next-generation sequencing data, it’s hard to remember that in 1869 Friedrich Miescher isolated DNA for the first time using cells from nearby hospital bandages. Computational genomics has now ushered in a new era of precision medicine, helping find clinically relevant mutations, potential diagnostics for asthma, and precision-based, personalized medicine.

However, as sequencing technologies evolve and our understanding of biological phenomena expands, sequencing data types need updated analysis techniques -- meaning software that is not only computationally efficient, but also quick to develop and easy to maintain.

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) recently came up with “Seq” (short for sequence), a high-performance, Python-based, compiled programming language for bioinformatics and genomics. Seq uses the syntax of Python, so users don’t need to learn an entirely new language or be expert programmers or software engineers.

“We implemented a number of bioinformatics algorithms and tools in Seq to show that, with just a fraction of the code, we are able to substantially outperform widely-used, hand-optimized versions, usually written in low-level languages like C or C++,” says MIT CSAIL PhD student Ariya Shajii, the lead author on a paper about Seq. “Seq uses a number of genomics-specific language features, libraries, and compiler optimizations to make all of this possible.”

High-performance genomics software is typically application specific and difficult to understand and maintain. It requires a delicate balance of performance engineering, computational modeling, and the ability to translate biological assumptions into algorithm and software optimizations.

Usually, a user has to battle tradeoffs between a software ecosystem that allows for rapid development at the expense of performance and scalability, or low-level languages that have higher performance but are harder to develop and maintain.

To demonstrate Seq’s versatility, the team reimplemented eight popular genomics tools in Seq, spanning key tasks in the genomics analysis pipeline, and found significant runtime and memory improvements over the original, hand-optimized C or C++ implementations, all while using less code and maintaining identical accuracy to them.

“Seq enables users to write high-level, Python-type code without worrying about low-level implementation details or optimizations, which the compiler handles internally,” says Shajii. “By offering biologists, bioinformaticians and other researchers a scalable way to prototype, experiment and analyze large biological datasets through a familiar, high-performance language, we hope that Seq will act as a catalyst for scientific discovery and innovation.”

Shajii developed Seq and wrote the paper, featured in Nature Biotechnology this month, alongside University of Victoria professor and former CSAIL postdoc Ibrahim Numanagić, Harvard Medical School research fellow Alexander T. Leighton, University of Victoria graduate student Haley Greenyer, and MIT professors Saman Amarasinghe and Bonnie Berger.