A TTS system converts text into speech through a sequence of processing stages. Understanding the pipeline helps diagnose quality issues and choose the right approach for a given use case.
Text analysis normalizes raw text: expands abbreviations (“Dr.” → “Doctor”), handles numbers (“$5.50” → “five dollars and fifty cents”), and resolves homographs (“read” vs “read”).
G2P (Grapheme-to-Phoneme) converts the normalized text into a sequence of phonemes — the smallest sound units. This is the stage where pronunciation accuracy is determined.
Prosody generation predicts pitch contour, duration, and stress patterns for the phoneme sequence. Early systems used rules; modern systems infer prosody from context using neural networks.
Acoustic model converts the phoneme and prosody representation into an acoustic feature representation, typically mel-spectrograms. This is the most computationally intensive stage.
Vocoder converts the acoustic features into a raw audio waveform. The vocoder determines whether the final output sounds clean or contains artifacts.
Modern TTS systems increasingly combine multiple pipeline stages into a single neural network, reducing compounding errors and simplifying training. However, modular pipelines still offer advantages for debugging — when output quality degrades, you can isolate which stage is responsible.
For most users, the pipeline is invisible — you input text and receive audio. But understanding the stages helps explain why certain errors occur: mispronunciations point to G2P issues, robotic cadence points to prosody problems, and buzziness or artifacts point to the vocoder.