A Theory of Unsupervised Translation Motivated by Understanding Whale Communication


Adam Tauman Kalai
Microsoft Research
Recent years have seen breakthroughs in language models that capture nuances of language, culture, and knowledge. Neural networks are capable of translating between languages -- in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. Our work is motivated by an ambitious interdisciplinary initiative, Project CETI, which is collecting a large corpus of sperm whale communications for machine analysis.

We propose a theoretical framework for analyzing UMT when no parallel data are available and when it cannot be assumed that the source and target corpora address related subject domains or possess similar linguistic structure. Our analysis suggests that *more* complex languages may in fact be translated with *greater* accuracy, using UMT, provided that there is common ground between what is being communicated. This is especially relevant for the possibly complex communication of sperm whales, mammals who we argue have common ground with humans. We also prove upper bounds on the amount of data required from the source language in the unsupervised setting as a function of the amount of data required in a hypothetical supervised setting. Our bounds suggest that the amount of source data required for unsupervised translation is comparable to the supervised setting. Our analysis is purely information-theoretic and raises interesting questions regarding efficient algorithms.

Joint work with Shafi Goldwasser, David F. Gruber, and Orr Paradise