TransCoder of Facebook that Converts Programming Languages

Facebook researchers said that they have developed what they call a neural TransCoder. It is a system that transforms code from one high-level programming language like Java, Python, and C++, into another. The system is unsupervised. Thus, this means that it searches for previously undetected patterns in data sets without labels and with a minimal amount of human supervision. Reportedly, it is outperforming rule-based baselines by a significant margin.

To migrate an existing codebase to a modern or more efficient language, like C++ or Java, requires expertise in both the target and source languages. Moreover, it is often very costly. For instance, the Commonwealth Bank of Australia, for over five years, spent around $750 to change its platform from COBOL to Java. Transcompilers can help, in theory. They can eliminate the need for rewriting code from scratch.

Nevertheless, they are challenging to build in practice. This is because different languages may have different syntax. Moreover, they rely on distinctive standard-library functions, variable types, and platform APIs.

Facebook’s new system tackles the challenge with an unsupervised learning approach. Transcoder can translate between Python, C++, and Java. TransCoder was first initialized with the pretraining of a cross-lingual language model. It can map pieces of code expressing the same instructions to identical representations regardless of the original language. A process called denoising auto-encoding trains the system to generate valid sequences, even when fed with noisy input data. Moreover, back-translation means TransCoder can create parallels that could be used for training.

Facebook’s TransCoder

TransCoder’s cross-lingual nature arises from the number of common tokens. These exist across programming languages, in common keywords like “try,” “if, “while, “for” and mathematical operators, digits, and English codes that appear in the source code. Back-translation serves to improve the translation quality of the system. It works by coupling a source-to-target model with a backward target-to-source model trained in parallel. The target-to-source model should target sequences into the source language. It thus produces noisy source sequences. Meanwhile, the source-to-target models help to reconstruct the target sequences from loud sources until the two models converge.

Researchers from Facebook trained TransCoder on a public GitHub corpus that contains over 2.8 million open source repositories. It targets translations at the function level. They teach TransCoder on all the source code available. Then, back-translation and denoising auto-encoding components are trained only on functions. It altered between components, with batches of around 6,000 tokens.

The researchers extracted 852 parallel functions in Python, C++, and Java from GeeksforGeeks. They did this to evaluate the performance of TransCoder. Thus, GeeksforGeeks is an online platform that gathers coding problems and can now present solutions in several programming languages. They have developed a new metric that can test whether hypothesis functions generate the same outputs as a reference when given the same inputs.

Nevertheless, the best-performing version of TransCoder did not generate many functions strictly identical to the references. Despite that, its translation had high computational accuracy. They attribute this to the incorporation of beam search. Beam search is a method that can maintain a set of partially decoded sequences.