A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation
TL;DR — Multilingual NMT works, but what is it actually good and bad at? We compare Transformer and recurrent (RNN) architectures across bilingual, multilingual, and zero-shot settings — grounding the analysis in professional post-edits and human error categories, not BLEU alone.
The Problem
By 2018, multilingual NMT was clearly promising — one model handling many directions, strong in low-resource settings, and even capable of zero-shot translation between pairs never seen in training. But the field lacked a careful answer to a basic question: what is a multilingual model actually capable of, and where does it break? Aggregate BLEU scores hide which kinds of errors each system makes.
Approach
We ran a controlled, comparative study with three axes:
- System type — bilingual vs. multilingual vs. zero-shot.
- Architecture — recurrent (RNN) vs. Transformer, the two dominant designs at the time.
- Language closeness — how the relatedness of source and target languages affects zero-shot quality.
Crucially, the analysis goes beyond automatic metrics. We use multiple professional post-edits of the outputs and break errors into human-interpretable categories — lexical, morphology, and word order — alongside BLEU and TER.
Key Findings
- The Transformer consistently produced higher-quality translations than the recurrent architecture, and the gap was most visible in the zero-shot directions.
- Language closeness matters for zero-shot: the more related the source and target languages, the better the zero-shot output.
- The error-category lens (lexical / morphology / word order) revealed differences that BLEU alone masked — i.e., two systems with similar BLEU can fail in qualitatively different ways.
For the full breakdown of BLEU/TER and per-category error counts, see the paper.
Why It Matters
This was one of the early in-depth audits of what multilingual NMT learns rather than just how high it scores. The methodology — pairing automatic metrics with professional post-edits and error typologies — is a template for evaluating any multilingual or zero-shot system, and the architecture/closeness findings informed later multilingual MT design.
Details & Resources
- Paper: COLING 2018 (ACL Anthology) · arXiv
- Citation: S. M. Lakew, M. Cettolo, M. Federico. Proceedings of the 27th International Conference on Computational Linguistics (COLING), Santa Fe, New Mexico, 2018.