TL;DR — Multilingual NMT works, but what is it actually good and bad at? We compare Transformer and recurrent (RNN) architectures across bilingual, multilingual, and zero-shot settings — grounding the analysis in professional post-edits and human error categories, not BLEU alone.

The Problem

By 2018, multilingual NMT was clearly promising — one model handling many directions, strong in low-resource settings, and even capable of zero-shot translation between pairs never seen in training. But the field lacked a careful answer to a basic question: what is a multilingual model actually capable of, and where does it break? Aggregate BLEU scores hide which kinds of errors each system makes.

Approach

We ran a controlled, comparative study with three axes:

  1. System type — bilingual vs. multilingual vs. zero-shot.
  2. Architecture — recurrent (RNN) vs. Transformer, the two dominant designs at the time.
  3. Language closeness — how the relatedness of source and target languages affects zero-shot quality.

Crucially, the analysis goes beyond automatic metrics. We use multiple professional post-edits of the outputs and break errors into human-interpretable categories — lexical, morphology, and word order — alongside BLEU and TER.

Key Findings

For the full breakdown of BLEU/TER and per-category error counts, see the paper.

Why It Matters

This was one of the early in-depth audits of what multilingual NMT learns rather than just how high it scores. The methodology — pairing automatic metrics with professional post-edits and error typologies — is a template for evaluating any multilingual or zero-shot system, and the architecture/closeness findings informed later multilingual MT design.

Details & Resources