Machine Learning Translates Lost Languages

Machine Learning Translates Lost Languages

MIT CSAIL developed a system. Thus, it helps linguists to decipher languages that have been lost to history.

Recent research suggested that most language that has ever existed are no longer living. People do not talk any more about those dead languages. Thus, we do not know enough about their syntax, grammar, or vocabulary. So, we are not able to understand their texts.

Lost languages are more than academic curiosity. It is because without them we miss an entire body of knowledge about the people who were speaking them. Most of them, unfortunately, have minimal records. Thus, scientists can not decipher them by using machine-translation algorithms like Google Translate. Some of those languages often lack traditional dividers like punctuation and white space, and they do not have a well-researched ‘relative’ language to be compared to.

However, recently, researchers of MIT’s CSAIL (Computer Science and Artificial Intelligence Laboratory) made significant development in that area. A new system can automatically decipher a lost language, without the need for advance knowledge of its relation to other languages. Also, the system is able to determine relationships between languages. They used the system to corroborate recent scholarship and suggested that the language of Iberian is not related to Basque.

Dead Languages

MIT Professor Regina Barzilay spearheads the system. The system relies on several principles.

Those principles are insights from historical linguistics. Such wisdom is the fact that languages generally evolve in specific, predictable ways. For example, certain sound substitutions are likely to occur, while a given language rarely adds or deletes an entire sound. A with a ‘p’ in the parent language may become a ‘b’ in the descendant language. Nevertheless, transforming to a ‘k’ is less likely because of the significant pronunciation gap.

MIT Ph.D. student Jiaming Luo and Barzilay, by incorporating those and other linguistic constraints, developed a decipherment algorithm. That algorithm can handle the scarcity of a guiding signal in the input and the vast space of possible transformations. The algorithm learns to implant language sounds into a multidimensional space. In that space differences in pronunciation are in the distance between corresponding sectors. That design enables them to capture pertinent patterns of language.

Moreover, it enables them to express them as computational constraints. The model can segment words in an ancient language. Furthermore, it is able to map them to counterparts in a related language.

It is a pretty interesting invention. If the system will work appropriately it will help the scientists that are studying dead languages.