Autoregressive (AR) and non-autoregressive (NAR) are two fundamentally different architectures for neural TTS. The choice between them determines the speed, quality, and behavior of the generated speech.
AR models generate speech tokens one at a time, where each token depends on the previously generated tokens. This is the same approach used by GPT-style language models.
How it works: The model predicts the next audio frame or spectrogram frame conditioned on the input text and all previously generated frames. Because each step sees the full previous context, the output has coherent long-range structure.
Strengths: More natural prosody, better handling of complex sentences, can produce more expressive speech.
Weaknesses: Slow — each token requires a separate forward pass. Cannot be parallelized. Susceptible to exposure bias (errors accumulate during generation).
Examples: WaveNet, Tacotron 2, VALL-E.
NAR models generate all speech tokens in parallel, or in a fixed number of steps, without depending on previously generated outputs.
How it works: The model takes the input text and predicts the full spectrogram or audio in one forward pass. Duration prediction — how long each phoneme should last — is handled by a separate module or learned jointly.
Strengths: Fast — generates audio significantly faster than real time. Stable — no accumulating errors. Consistent — same input always produces similar output.
Weaknesses: Can produce flatter prosody. Less expressive than AR models for complex or emotional content. Duration prediction errors can sound unnatural.
Examples: FastSpeech, FastPitch, VITS (hybrid), WhisperSpeech.
| Aspect | Autoregressive | Non-Autoregressive |
|---|---|---|
| Generation speed | Slow (sequential) | Fast (parallel) |
| Prosody naturalness | Higher | Good |
| Stability | Lower (drift risk) | Higher |
| Consistency | Variable | Predictable |
| Computational cost | Higher per sample | Lower per sample |
Many modern TTS systems use hybrid architectures: an AR component for prosody and expressiveness, combined with a NAR component for fast and stable generation. This gives the best of both approaches at the cost of more complex training and inference pipelines.
For real-time applications like voice assistants, NAR models are preferred for their speed and consistency. For offline content generation where quality matters more than latency, AR models or hybrids often produce more natural results. For batch voiceover production, NAR models offer faster throughput with acceptable quality.