Isometric MT: Neural Machine Translation for Automatic Dubbing

TL;DR — Automatic dubbing needs translations that match the source length so speech stays in sync. Prior methods over-generate then re-rank. Isometric MT teaches a single transformer to produce length-matched output directly — simpler, and better than the more complex alternatives.

The Problem

In automatic dubbing, the translated speech has to fit the time the original speaker was talking. That makes length a first-class constraint: the translation should land within roughly ±10% of the source character count, without sacrificing quality. The two pull against each other — squeezing output to a target length usually degrades translation. The common fix is a two-step pipeline: generate an N-best list of hypotheses, then re-rank them by a length-and-quality function. It works, but it’s heavy: multiple decodes plus an auxiliary ranker.

Approach

We replace that pipeline with a self-learning approach: the transformer learns to generate length-compliant translations directly, in a single pass. No N-best list, no separate ranking function — the length-matching behavior is baked into the model itself.

Conventional N-best generation and re-ranking versus the Isometric MT self-learning approach
Left: the conventional generate-N-best-then-re-rank pipeline. Right: Isometric MT learns to produce length-matched output directly, in one decode.

Key Results

  • Evaluated on four language pairs — English → French, Italian, German, Spanish — on a publicly available benchmark.
  • Both automatic and manual evaluation show Isometric MT outperforms the more complex N-best + re-ranking approaches from the literature — while being a single model with a single decoding pass.

Why It Matters

Isometric MT makes length control a property of the model rather than a bolt-on pipeline, which is cheaper to run and easier to deploy in a real dubbing system. It’s a cornerstone of the broader automatic-dubbing line of work — from verbosity control and isochrony-aware translation to jointly optimizing translation and speech timing — and it underpinned the Isometric Spoken Language Translation shared task at IWSLT 2022.

Details & Resources

← All posts