MOS is the standard metric for evaluating TTS voice quality. Human listeners rate generated speech samples on a 1-5 scale, and the scores are averaged across all raters.
| Score | Quality |
|---|---|
| 5 | Excellent — completely natural, indistinguishable from human |
| 4 | Good — slightly synthetic but fully intelligible and pleasant |
| 3 | Fair — clearly synthetic but understandable with minor effort |
| 2 | Poor — difficult to understand, requires significant effort |
| 1 | Bad — unintelligible |
A MOS of 4.0 or above is generally considered production-quality. Top commercial TTS services score in the 4.2-4.5 range. Local models in 2026 typically score between 3.5-4.3 depending on the model, voice, and test conditions.
MOS measures perceived naturalness in isolation, not real-world fitness. A model that scores 4.3 on individual sentences may still fail on long-form narration due to prosody drift or listener fatigue — dimensions MOS does not capture.
MOS is also sensitive to the test methodology. The choice of raters, the language being tested, the content of the test sentences, and the playback system all affect scores. Comparing MOS scores across different papers or products is rarely meaningful unless the methodology is identical.
For practical evaluation, supplement MOS with: