Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoye, Marc’Aurelio Ranzato
Data scarcity is among the main challenges for training a usable Neural Machine Translation(NMT) model. Despite the progress made for a high-resource language (such as; English-German) pair, most languages are characterized by the absence of parallel data to train an NMT system. As my first series of paper review and writing a post I will summarize the “Phrase-Based & Neural Unsupervised Machine Translation”, to be presented at EMNLP18. Authors suggest two model variants; i) Phrase-based and ii) Neural.
Both the Phrase-based and Neural rely on three core (#P1, #P2, and #P3) principles, consequently outperforming state-of-the-art approaches on unsupervised translation.
The illustration aims to visualize the idea behind the three principles:
– A) shows two monolingual datasets distribution (see the legend).
– B) Initialization: the two distributions are roughly aligned, with mechanism like word-by-word translation.
– C) Language Modeling (LM): is learned for each domain. The LM’s is then utilized for denoise examples.
– D) Back-translation: a source &rarr target inference stage is followed by a target–>source inference to reconstruct the examples back to the original language. A similar procedure is applied in the reverse/dual translation direction to get the feedback signals for optimizing the target–>source and source–>target models.
The aim is to learn first level (such as; word-by-word) translations
In the Neural case, i) jointly learn BPE model, ii) apply BPE, and iii) learning token embeedings to initialize the encoder-decoder lookup table, whereas for the Phrase-Based the initial phrase-tables are populated using a bilingual dictionary build from monolingual data.
Note: if the languages are distant learning, learning bilingual dictionary might be required in the Neural scenario.
Integrating the above three principles, the Neural and Phrase-Based algorithms are given, where S and T representes source and target examples, and language models trained using source and target monolingual data are represented as Ps-t and Pt->s.
Experimental are done using the well know WMT16 En<>De and WMT14 En<>Fr benchmarks. The combination of the PBSMT and NMT showed to give the best results (see the last row).
If you are interested and want to explore more about unsupervised MT approaches, check out the following works: