Duration prediction determines how long each acoustic unit — phoneme, syllable, or word — lasts in the generated audio. It is one of the most perceptually important components of a TTS system. Wrong durations produce rushed, draggy, or robotic pacing even when every other component is perfect.
Human speech has natural timing variation. Stressed vowels are longer. Unstressed vowels are shorter. Words at the end of a phrase are often lengthened. Pauses between sentences vary by context.
When duration prediction is flat — every phoneme getting roughly equal time — the output sounds monotonous and mechanical. When it is erratic, the speech sounds disfluent. Good duration prediction is what makes generated speech sound like a human reading, not a machine reciting.
Rule-based duration uses hand-written rules: stressed vowels get X ms, unstressed vowels get Y ms, final lengthening adds Z ms. Simple, predictable, but never sounds fully natural.
Statistical duration models learn from data. A model trained on forced-aligned speech data learns the typical duration of each phoneme in different contexts. More natural than rules but requires alignment data.
Neural duration prediction is integrated into the TTS model itself. The model learns duration as part of the end-to-end training process. Used in modern non-autoregressive models like FastSpeech.
In NAR models, a separate duration predictor module estimates how many acoustic frames each input token should occupy. This duration is used to expand the token sequence before the acoustic model generates the spectrogram.
If the duration predictor underestimates, speech sounds rushed. If it overestimates, speech sounds slow. Errors are especially noticeable on content words and at phrase boundaries.
Duration prediction is closely related to alignment — the problem of matching input text frames to output audio frames. In AR models, alignment emerges naturally from the autoregressive process. In NAR models, alignment must be learned explicitly, often through an attention mechanism during training.
Duration prediction errors are one of the main failure modes of non-autoregressive TTS. A model with otherwise excellent voice quality can sound unnatural if duration prediction is poor. Listen for pacing issues — words that feel truncated or stretched — as a sign of duration prediction problems.