Published Jun 02, 2026

Prosody

Prosody is the rhythm, stress, and intonation of speech — the music behind the words. It is what makes a sentence sound like a statement, question, or exclamation, and what conveys emotion, emphasis, and attitude.

Components

Pitch contour — the rise and fall of the voice across a sentence. A rising pitch at the end signals a question. A falling pitch signals finality. Flat pitch is the main giveaway of robotic TTS.

Duration and timing — how long each syllable and pause lasts. Natural speech has micro-variations in timing. Equal spacing sounds synthetic.

Stress and emphasis — which syllables and words are louder or higher in pitch. “I did NOT say that” has a different meaning than “I did not say THAT.”

Why It Is Hard for TTS

Prosody depends on meaning, which requires understanding the text. A model must know whether “That’s great.” is sincere or sarcastic, whether a comma is a grammatical pause or a list separator, and which word in a sentence carries the primary emphasis.

Early TTS models had no prosody — every sentence used the same flat pattern. Modern neural models infer prosody from context, but it remains the main frontier for naturalness. Flat or wrong prosody is what listeners perceive as “robotic.”

In Practice

SSML tags like <prosody rate="..." pitch="..."> give manual control over prosody. For long-form content, per-chapter prosody variation (slightly faster for action scenes, slower for reflective passages) makes the difference between listenable and fatiguing narration.