Attention is the mechanism that allows a TTS model to determine which part of the input text corresponds to which part of the generated audio. Without correct alignment, the model would pronounce words in the wrong order or skip syllables.
When a TTS model generates a 5-second audio clip from a 10-word sentence, it must decide: which acoustic frame corresponds to which input token? The first 200ms maps to the first phoneme, the next 150ms to the second phoneme, and so on. This mapping is called alignment.
In autoregressive models, alignment emerges naturally — the model generates audio left to right, so the first output corresponds to the first input. But in non-autoregressive models, alignment must be learned explicitly.
Cross-attention computes a weighted sum of encoder outputs for each decoder position. The weights — the attention matrix — show which input tokens the model is “looking at” when generating each output frame.
In a well-trained TTS model, the attention matrix is diagonal: output frame 1 attends to input token 1, frame 2 attends to token 2, and so on. Off-diagonal attention indicates alignment errors, often audible as skipping, stuttering, or repeating sounds.
Speech alignment has a crucial property: it is monotonic. The text progresses forward in time without backtracking. You cannot go back and re-pronounce a previous word. Specialized attention mechanisms enforce this property:
When alignment fails, the output exhibits characteristic artifacts:
These are among the most common failure modes in neural TTS and are almost always alignment problems.
Alignment quality is typically evaluated by visualizing the attention matrix. A clean diagonal line indicates good alignment. Scattered or chaotic attention indicates problems. For production TTS, models with robust monotonic attention mechanisms are preferred to minimize alignment failures.