Published Jun 02, 2026

SSML (Speech Synthesis Markup Language)

SSML is an XML-based markup language that controls how a TTS engine renders text. It gives authors fine-grained control over pronunciation, pacing, emphasis, and audio formatting that plain text cannot express.

Common Tags

<break> — inserts a pause of a specified duration.

<break time="500ms"/>

<emphasis> — stresses a word or phrase.

That was <emphasis level="strong">not</emphasis> the plan.

<prosody> — adjusts rate, pitch, and volume.

<prosody rate="slow" pitch="+2st">She whispered softly.</prosody>

<say-as> — interprets how to read specific content.

<say-as interpret-as="cardinal">123</say-as>   <!-- one hundred twenty-three -->
<say-as interpret-as="ordinal">1st</say-as>     <!-- first -->
<say-as interpret-as="characters">AI</say-as>  <!-- A-I, not "eye" -->

<phoneme> — forces a specific pronunciation using IPA.

<phoneme alphabet="ipa" ph="ˈnjuːkliər">nuclear</phoneme>

Why SSML Matters

Plain text TTS guesses everything — pronunciation, pacing, emphasis. SSML lets you override those guesses. For audiobooks, SSML controls character voices and chapter pacing. For voiceovers, it ensures brand names and technical terms are pronounced correctly. For accessibility, it controls reading speed and phrasing.

Limitations

Not all TTS engines support the same SSML tags. Some ignore <phoneme> or have limited <prosody> ranges. Always test SSML output rather than assuming compliance.