Predicting where you're from by how you write - and what that means for linguistics

CSAIL researchers recently developed an algorithm that can probabilistically estimate the native language of someone writing in English - and, with it, potentially help categorize the structures of foreign languages without having to read a single word from them.

The system, which combed through more than 1,000 English-language essays written by native speakers of 14 different languages, first analyzes the parts of speech of the words and the relationships between them, and then looks for patterns that correlate with the writers’ native languages.

The result is that the algorithm can “read” an essay and create a profile of likely native languages (i.e. a “55 percent chance of being Russian”).

Unexpectedly, these “probability estimates” also provide a quantitative measure of how closely related any two languages were, in terms of syntactic patterns like subject-verb order and negation formation.
In other words, if the algorithm predicts that an essay has a 51 percent chance of having been written by a native Russian speaker, a 33 percent chance of having been written by a native Polish speaker, and a 16 percent chance of having been written by a native Japanese speaker, it can also conclude that Russian speakers’ syntactic patterns were more similar to those of Polish speakers than to those of Japanese speakers.

When researchers used the measure to create a family tree of the 14 languages, it was almost identical to a family tree generated from data manually amassed by linguists. The nine languages that are in the Indo-European family, for instance, were clearly distinct from the five that aren’t.

Such findings may allow linguists to predict typological features of languages for which there’s little to no linguistic knowledge - which would be a huge boon for filling in the gaps in databases like the World Atlas of Language Structures (WALS), which has missing entries for dozens of languages.

The team - which includes CSAIL principal research scientist Boris Katz, graduate students Yevgeni Berzak and former postdoc Roi Reichart (now at Technion) - recently demonstrated 72 percent accuracy in predicting a language’s typological features based solely on the similarity scores produced by the algorithm.

For more on MIT News: