Published Jun 02, 2026

Attention and Alignment

Attention is the mechanism that allows a TTS model to determine which part of the input text corresponds to which part of the generated audio. Without correct alignment, the model would pronounce words in the wrong order or skip syllables.

The Alignment Problem

When a TTS model generates a 5-second audio clip from a 10-word sentence, it must decide: which acoustic frame corresponds to which input token? The first 200ms maps to the first phoneme, the next 150ms to the second phoneme, and so on. This mapping is called alignment.

In autoregressive models, alignment emerges naturally — the model generates audio left to right, so the first output corresponds to the first input. But in non-autoregressive models, alignment must be learned explicitly.

Cross-Attention

Cross-attention computes a weighted sum of encoder outputs for each decoder position. The weights — the attention matrix — show which input tokens the model is “looking at” when generating each output frame.

In a well-trained TTS model, the attention matrix is diagonal: output frame 1 attends to input token 1, frame 2 attends to token 2, and so on. Off-diagonal attention indicates alignment errors, often audible as skipping, stuttering, or repeating sounds.

Monotonic Attention

Speech alignment has a crucial property: it is monotonic. The text progresses forward in time without backtracking. You cannot go back and re-pronounce a previous word. Specialized attention mechanisms enforce this property:

  • Monotonic Attention forces the attention weights to move forward only.
  • Location-Sensitive Attention tracks previous attention positions to guide future ones.
  • Gaussian Attention assumes attention follows a Gaussian distribution around a center that moves forward.

Alignment Errors and Artifacts

When alignment fails, the output exhibits characteristic artifacts:

  • Skipping — The model jumps ahead in the text, omitting words or syllables
  • Stuttering — The model repeats the same phoneme or word
  • Babbling — The model loses alignment entirely and produces unintelligible sounds
  • Late start — The model generates silence or noise before finding the correct alignment

These are among the most common failure modes in neural TTS and are almost always alignment problems.

In Practice

Alignment quality is typically evaluated by visualizing the attention matrix. A clean diagonal line indicates good alignment. Scattered or chaotic attention indicates problems. For production TTS, models with robust monotonic attention mechanisms are preferred to minimize alignment failures.