Published Jun 02, 2026

MOS (Mean Opinion Score)

MOS is the standard metric for evaluating TTS voice quality. Human listeners rate generated speech samples on a 1-5 scale, and the scores are averaged across all raters.

The Scale

Score Quality
5 Excellent — completely natural, indistinguishable from human
4 Good — slightly synthetic but fully intelligible and pleasant
3 Fair — clearly synthetic but understandable with minor effort
2 Poor — difficult to understand, requires significant effort
1 Bad — unintelligible

A MOS of 4.0 or above is generally considered production-quality. Top commercial TTS services score in the 4.2-4.5 range. Local models in 2026 typically score between 3.5-4.3 depending on the model, voice, and test conditions.

Limitations

MOS measures perceived naturalness in isolation, not real-world fitness. A model that scores 4.3 on individual sentences may still fail on long-form narration due to prosody drift or listener fatigue — dimensions MOS does not capture.

MOS is also sensitive to the test methodology. The choice of raters, the language being tested, the content of the test sentences, and the playback system all affect scores. Comparing MOS scores across different papers or products is rarely meaningful unless the methodology is identical.

What to Use Instead

For practical evaluation, supplement MOS with:

  • Task-specific tests — Does the model handle your domain vocabulary? Does it maintain consistency across 10 minutes of narration?
  • A/B preference tests — Which of two models do listeners prefer for your specific use case?
  • Error analysis — What types of errors (mispronunciations, prosody, artifacts) does the model make, and how often?