June 10 '15

CSAIL team proves 40-year-old algorithm can't be improved upon

Researchers have determined that the long-established edit distance algorithm - celebrating its 40th birthday this year - cannot be improved.

Comparing the genomes of different species — or different members of the same species — is the basis of a great deal of modern biology. DNA sequences that are conserved across species are likely to be functionally important, while variations between members of the same species can indicate different susceptibilities to disease.

The basic algorithm for determining how much two sequences of symbols have in common — the “edit distance” between them — is now more than 40 years old. And for more than 40 years, computer science researchers have been trying to improve upon it, without much success.

At a conference next week CSAIL researchers will report that, in all likelihood, that’s because the algorithm is as good as it gets. If a widely held assumption about computational complexity is correct, then the problem of measuring the difference between two genomes — or texts, or speech samples, or anything else that can be represented as a string of symbols — can’t be solved more efficiently.

In a sense, that’s disappointing, since a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes. But it also means that computer scientists can stop agonizing about whether they can do better.

“This edit distance is something that I’ve been trying to get better algorithms for since I was a graduate student, in the mid-’90s,” says Piotr Indyk, a professor of computer science and engineering at MIT and a co-author of the new paper being presented at the ACM Symposium on Theory of Computing (STOC). “I certainly spent lots of late nights on that — without any progress whatsoever. So at least now there’s a feeling of closure. The problem can be put to sleep.”