Published Jun 02, 2026

Duration Prediction

Duration prediction determines how long each acoustic unit — phoneme, syllable, or word — lasts in the generated audio. It is one of the most perceptually important components of a TTS system. Wrong durations produce rushed, draggy, or robotic pacing even when every other component is perfect.

Why Duration Matters

Human speech has natural timing variation. Stressed vowels are longer. Unstressed vowels are shorter. Words at the end of a phrase are often lengthened. Pauses between sentences vary by context.

When duration prediction is flat — every phoneme getting roughly equal time — the output sounds monotonous and mechanical. When it is erratic, the speech sounds disfluent. Good duration prediction is what makes generated speech sound like a human reading, not a machine reciting.

Approaches

Rule-based duration uses hand-written rules: stressed vowels get X ms, unstressed vowels get Y ms, final lengthening adds Z ms. Simple, predictable, but never sounds fully natural.

Statistical duration models learn from data. A model trained on forced-aligned speech data learns the typical duration of each phoneme in different contexts. More natural than rules but requires alignment data.

Neural duration prediction is integrated into the TTS model itself. The model learns duration as part of the end-to-end training process. Used in modern non-autoregressive models like FastSpeech.

The Duration Predictor in Non-Autoregressive TTS

In NAR models, a separate duration predictor module estimates how many acoustic frames each input token should occupy. This duration is used to expand the token sequence before the acoustic model generates the spectrogram.

If the duration predictor underestimates, speech sounds rushed. If it overestimates, speech sounds slow. Errors are especially noticeable on content words and at phrase boundaries.

Alignment and Duration

Duration prediction is closely related to alignment — the problem of matching input text frames to output audio frames. In AR models, alignment emerges naturally from the autoregressive process. In NAR models, alignment must be learned explicitly, often through an attention mechanism during training.

In Practice

Duration prediction errors are one of the main failure modes of non-autoregressive TTS. A model with otherwise excellent voice quality can sound unnatural if duration prediction is poor. Listen for pacing issues — words that feel truncated or stretched — as a sign of duration prediction problems.

Try Spokio for Mac.

Offline text-to-speech for Mac. Local voice cloning, batch export, and no cloud uploads for your text, audio, or voice samples.

macOS 15.6+ | Apple Silicon & Intel | English only

hi@spokio.pro

Duration Prediction

Why Duration Matters

Approaches

The Duration Predictor in Non-Autoregressive TTS

Alignment and Duration

In Practice

Try Spokio for Mac.

Product

Features

Use Cases

Compare