How TTS Actually Works — The Pipeline

Text-to-speech looks like magic from the outside — text goes in, speech comes out. But the transformation from a string of characters to an audible waveform is a sequence of distinct processing stages, each with its own engineering challenges and failure modes.

This article walks through every stage of the modern TTS pipeline, from raw text to final audio. It covers what each stage does, how different architectures implement it, and where the real tradeoffs are.

The Pipeline at a Glance

Every TTS system, regardless of architecture, performs these transformations:

Raw text (Unicode string)
  → Stage 1: Text Normalization
  → Normalized text
  → Stage 2: Grapheme-to-Phoneme (G2P)
  → Phoneme sequence + linguistic features
  → Stage 3: Prosody Prediction
  → Duration, pitch, energy contours
  → Stage 4: Acoustic Generation
  → Acoustic features or audio tokens
  → Stage 5: Waveform Synthesis (Vocoding)
  → PCM audio (WAV / MP3 / raw samples)

Stages 1-3 are linguistic processing — they turn text into structured linguistic representations. Stages 4-5 are acoustic processing — they turn linguistic representations into sound.

Modern end-to-end models collapse some of these stages into neural networks that learn the transformations implicitly, but the underlying information flow is the same. Understanding each stage explains why TTS systems behave the way they do, and where the remaining failure modes live.

Stage 1: Text Normalization

Text normalization is the most underestimated stage in the pipeline. It takes raw input text and converts it into a form that downstream stages can process predictably.

What needs normalizing

Case and whitespace:

"  HELLO   WORLD  "  →  "hello world"

Unicode normalization: Different Unicode representations of the same character (e.g., composed vs. decomposed accented letters) must be collapsed into a canonical form, typically NFC or NFD.

# "é" can be U+00E9 (single codepoint) or U+0065 U+0301 (e + combining acute)
"\u00e9" → "e"  # NFC normalization

Number expansion: Numbers must be expanded into words appropriate to their context:

Input	Context	Expansion
42	“The answer is 42”	“forty-two”
2026	“In 2026 we launched”	“twenty twenty-six”
2026	“The total is 2026”	“two thousand twenty-six”
$49.99	“It costs $49.99”	“forty-nine dollars and ninety-nine cents”
3.14	“Pi is 3.14”	“three point one four”
+1 (555) 123-4567	“Call +1 555…”	“plus one five five five one two three four five six seven”

Context-dependent expansion requires semantic understanding. “1.2.1” in a version number may be read as “one point two point one” but “1.2.1” in a decimal context does not make sense. Many TTS systems use regex cascades or finite-state transducers with language-specific rules; some systems add neural classifiers to disambiguate context.

Abbreviation and acronym expansion:

Abbreviation	Spoken form
“Dr.”	“doctor” or “drive”
“St.”	“saint” or “street”
“e.g.”	“for example”
“i.e.”	“that is”
“vs.”	“versus”
“NASA”	“nasa” (as a word)
“API”	“A P I” (letter by letter)
“UNICEF”	“you-ni-cef” (as a word)

The abbreviation/acronym disambiguation problem is genuinely hard. “NASA” is acronymic (spoken as a word), “FBI” is initialistic (spoken letter by letter), and “URL” is contested (some say “earl”, most say “U R L”).

Date and time expansion:

"2024-03-15" → "March fifteenth, twenty twenty-four"  # US convention
"2024-03-15" → "the fifteenth of March, twenty twenty-four"  # UK convention
"3:45 PM"    → "three forty-five PM"
"3:45"       → "three forty-five" or "quarter to four"

URL and email handling:

"https://github.com/user/repo" → "github dot com slash user slash repo"
"user@example.com"             → "user at example dot com"

Special characters and symbols:

"&" → "and"
"%" → "percent"
"±" → "plus or minus"
"→" → "arrow" or "leads to"
"#" → "number" (before a numeral) or "hashtag" (social media)

How TTS systems handle it

Traditional: Rule-based normalization cascades (eSpeak, Festival, MaryTTS) with hundreds to thousands of regex patterns
Neural: Seq2seq models trained on (raw, normalized) pairs — used by Google’s TTS, Tacotron 2, modern systems
Hybrid: Rules for deterministic cases (dates, numbers) + neural for ambiguous cases
Kokoro: Delegates to the Misaki G2P engine which includes normalization
Orpheus, Qwen3-TTS, Chatterbox: Expect pre-normalized input; do minimal normalization internally

Failure mode: A TTS system that sounds great on clean sentences will break on “The API v2.1 costs $9.99/mo, saving you ~15%!” unless normalization handles abbreviations, version numbers, currency, per-month, and approximation symbols correctly.

Stage 2: Grapheme-to-Phoneme (G2P)

Grapheme-to-phoneme conversion maps written characters to the sounds they represent. This is the bridge between orthography and pronunciation.

Phoneme inventories

Different languages use different phoneme sets. English TTS commonly uses:

ARPABET — 39-40 phonemes, ASCII-only, common in US TTS research: KAA1 T for “cat”
IPA — International Phonetic Alphabet, precise but requires Unicode: /kæt/
Custom — Each system often builds its own reduced set tuned to the training data

Word	ARPABET	IPA
cat	K AE1 T	/kæt/
dog	D AO1 G	/dɒɡ/
thought	TH AO1 T	/θɔːt/
rhythm	R IH1 DH AH0 M	/ˈɹɪðəm/

The heteronym problem

Heteronyms are words spelled identically but pronounced differently based on meaning:

Word	Pronunciation	Context
read	/ɹɛd/ (past tense)	“I read that book yesterday”
read	/ɹid/ (present tense)	“I read books every day”
lead	/lɛd/ (the metal)	“pencil lead is toxic”
lead	/lid/ (to guide)	“she will lead the team”
bass	/beɪs/ (fish)	“caught a large bass”
bass	/beɪs/ (low tone)	“turn up the bass”
wound	/wund/ (injury)	“the wound healed slowly”
wound	/waʊnd/ (twisted)	“she wound the clock”

Traditional G2P systems handle this with part-of-speech tagging and lookup tables. Neural G2P systems learn contextual disambiguation from training data — if they have enough examples of each reading.

G2P approaches

Rule-based (eSpeak, Festival): Hand-written pronunciation rules with exception dictionaries. Predictable output, zero variance between runs. Brittle for names, loanwords, and neologisms.

# Conceptual eSpeak-style rule
"c" → /s/ before e, i, y
"c" → /k/ before a, o, u, consonants

Dictionary-based (CMUdict, Moby): Large lookup tables mapping full words to phoneme sequences. High accuracy for known words, zero coverage for unknown ones. CMU Pronouncing Dictionary covers ~134,000 English words.

# CMUdict entry for "technology"
T EH0 K N AA1 L AH0 JH IY0

Neural (phonemizer + backend, Misaki, Transformer G2P): Seq2seq transformer trained on (grapheme, phoneme) pairs. Generalizes to unseen words. Can learn context from surrounding words. Requires training data.

# Misaki (used by Kokoro)
from misaki import en

g2p = en.G2P()
phonemes = g2p("The technology works")  
# Returns phoneme string with stress markers

Kokoro’s approach

Kokoro uses Misaki G2P with language-specific backends. The lang_code parameter selects the phoneme inventory and rule set:

"a" — American English
"b" — British English
"j" — Japanese
"k" — Korean
"z" — Mandarin Chinese

The G2P output is a phoneme string passed directly to the acoustic model. Misaki handles context-dependent rules, stress assignment, and basic text normalization in a single pass.

Orpheus’s approach

Orpheus skips explicit G2P entirely. The Llama-3.2-3B backbone is trained on interleaved text and SNAC audio tokens. The model learns grapheme-to-phoneme mapping implicitly through the next-token prediction objective. This is the key architectural difference: instead of a separate G2P module, the LLM internalizes the mapping.

This works because the model sees text tokens paired with corresponding audio tokens during training. It can learn that “read” in one context produces one audio token sequence and in another context produces a different one. This is why Orpheus can sometimes handle heteronyms without an explicit disambiguation module — but it also means pronunciation errors are harder to patch with a simple dictionary edit.

Stage 3: Prosody Prediction

Prosody is the rhythm, stress, and intonation of speech — the patterns that make it sound human rather than flat and robotic.

What prosody covers

Duration: How long each phoneme lasts. A sentence-final syllable is typically lengthened. Unstressed vowels are shorter than stressed ones.

Pitch contour: The fundamental frequency (F0) movement. Questions rise in pitch at the end. Statements fall. Excitement raises overall pitch range. Boredom compresses it.

Pauses: Silence between phrases, sentences, and paragraphs. Not all commas produce pauses, and not all pauses come from commas.

Stress: Emphasizing certain words or syllables to convey meaning. Compare:

“I didn’t take the money” (denial)
“I didn’t take your money” (specific denial)
“I didn’t take your money” (took something else)

Traditional prosody prediction

Older TTS systems (Diphone synthesis, HMM-based synthesis) used dedicated prosody models:

# Conceptual: duration prediction from linguistic features
phone_duration_ms = f(
    phone_identity,    # /æ/ vs /i/ 
    syllable_stress,   # primary, secondary, unstressed
    phrase_position,   # initial, medial, final
    sentence_type,     # statement, question, exclamation
    speaking_rate      # slow, normal, fast
)

Pitch contours were predicted using superpositional models:

Phrase curve: Slow rise then fall over the utterance
Accent curve: Local pitch movements on stressed syllables
Micro-prosody: Perturbations caused by specific phonemes (voiceless consonants raise pitch)

Modern prosody prediction

End-to-end models internalize prosody prediction into the acoustic generation stage. The model learns from data which prosodic patterns correspond to which text features. This produces more natural prosody than hand-crafted rules, but it also means the model’s prosody is only as good as the training data — and it may not respect explicit user intent.

Some models add explicit prosody control:

Chatterbox: expression or exaggeration conditioning can steer emotional delivery, depending on implementation
Orpheus: Emotive tags (<laugh>, <sigh>, <gasp>) inject explicit prosodic events
Qwen3-TTS: Instruction following for style (“speak excitedly”, “whisper”)

These work by conditioning the model on additional vectors during generation, steering the implicit prosody prediction toward the desired target.

Stage 4: Acoustic Generation

This is the core of the TTS pipeline — the stage that transforms linguistic features into acoustic representations. The approach used here defines the architecture of the entire system.

Paradigm 1: Autoregressive Token-Based

Predict discrete audio tokens one step at a time, using a language model.

How it works:

Token stream: [T1] [T2] [T3] ... [Tn]
Autoregressive LM predicts each token conditioned on all previous tokens:
  P(T2 | T1), P(T3 | T1, T2), ...
Audio codec decodes token sequence → waveform

Model: AudioLM, VALL-E, CosyVoice, Orpheus

Tokenizers: Audio codecs that discretize continuous audio into tokens:

Codec	Bitrate	Tokens/frame	Frame rate	Sample rate
SNAC	~7.3 kbps	7	25 Hz (L0)	24 kHz
EnCodec	1.5-24 kbps	4-32	50 Hz	32 kHz
DAC	8-32 kbps	4-12	50 Hz	44.1 kHz
S3 (Chatterbox)	~6.4 kbps	8	25 Hz	16 kHz
Qwen-TTS-12Hz	~12 kbps	16	12.5 Hz	24 kHz

Strengths:

Natural prosody — the language model captures long-range acoustic patterns
Streaming friendly — tokens produced one at a time
Simple training — standard next-token prediction loss

Weaknesses:

Autoregressive latency — must generate one token at a time
Exposure bias — training with teacher forcing, inference with sampled tokens
Repetition loops — the model can get stuck on repeating tokens
Error accumulation — a bad early token corrupts everything after it

Code example (Orpheus conceptual):

# Autoregressive token generation in Orpheus with vLLM
tokens = []
for step in range(max_tokens):
    logits = llm.forward(tokens)
    next_token = sample(logits[-1], temperature=0.6, top_p=0.9)
    tokens.append(next_token)
    if next_token == stop_token_id:
        break
# tokens → SNAC decoder → 24kHz audio

Paradigm 2: Flow Matching

Learn a continuous vector field that transforms noise into the target acoustic distribution.

How it works:

Start: Gaussian noise z ~ N(0, I)
Learn: Vector field v(t, x) that pushes z toward the data distribution
Inference: Solve ODE from t=0 to t=1
  x_0 = z
  x_{t+dt} = x_t + v(t, x_t) * dt
  x_1 ≈ target mel-spectrogram

Model: Chatterbox (S3Token2Mel), VoiceBox, NaturalSpeech 3, CosyVoice 2

Architecture details (Chatterbox):

# Chatterbox flow matching: 10-step Euler solver
def solve_euler(self, x, t_span, mu, mask, spks, cond):
    for i in range(n_timesteps):  # 10 steps
        dt = t_span[i+1] - t_span[i]
        dxdt = self.estimator(x, t_span[i], mu, mask, spks, cond)
        x = x + dt * dxdt
    return x  # final mel-spectrogram

Strengths:

High quality — flow matching produces excellent fidelity
Controllable — conditioning vectors directly steer generation
Stable training — no adversarial learning, no mode collapse
Fixed inference cost — always N steps, regardless of input length

Weaknesses:

Slow inference — 10-50 ODE steps needed for quality
Not streaming friendly — full spectrogram is generated before output
Computationally heavier — multiple forward passes per utterance

Distillation (Turbo trick): Chatterbox Turbo uses distillation-style optimization to reduce the cost of multi-step generation, trading some quality for faster inference.

Paradigm 3: Diffusion

Iteratively denoise Gaussian noise into the target distribution.

How it works:

Training: Add noise to mel-spectrograms, learn to predict noise
Inference: Start from pure noise, iteratively denoise:
  x_T = N(0, I)  # pure noise
  x_{T-1} = denoise(x_T)
  ...
  x_0 = target spectrogram

Model: WaveGrad, DiffWave, Grad-TTS

Strengths:

High quality — diffusion can produce strong generative fidelity
Good coverage — doesn’t collapse to average outputs

Weaknesses:

Slow — 50-1000 denoising steps at inference
Computationally expensive — massive memory and FLOPs
Largely replaced by flow matching in modern TTS

Paradigm 4: Parallel Acoustic Models (Non-Autoregressive)

Transform text directly to acoustic features in a single forward pass using alignment mechanisms.

How it works:

Text → [Encoder] → Hidden states → [Duration predictor] → Expanded states → [Decoder] → Mel-spectrogram

Duration predictor estimates how long each input unit lasts
The expanded states are then decoded to mel-spectrogram in parallel

Model: FastSpeech 1/2, StyleTTS 2, Glow-TTS

StyleTTS 2 specific approach: StyleTTS 2 uses AdaIN (Adaptive Instance Normalization) to separate content and style. A style encoder extracts prosodic style from a reference utterance, while the content encoder processes the text. The two streams combine through AdaIN layers to produce the output.

# StyleTTS 2 high-level architecture
content = content_encoder(text)
style = style_encoder(reference_audio)
# AdaIN: normalize content statistics to match style statistics
output = adain(content, style)  # align mean and variance

Strengths:

Fast — single forward pass, no iterative sampling
Controllable style — explicit separation of content and style
Good for its size — Kokoro-82M uses this paradigm

Weaknesses:

Alignment is fragile — duration prediction errors cause robotic timing
Less natural than autoregressive for very expressive speech
Fixed output length — no natural streaming

Paradigm 5: LLM-Based (Unified Language Model)

Extend a large language model’s vocabulary with audio codec tokens and train on interleaved text+audio sequences.

How it works:

Vocabulary:
  Original: [the, cat, sat, ...] + system tokens
  Extended: [text_tokens ...] + [custom_token_1, custom_token_2, ..., custom_token_N]

Training example:
  [BOS] tara: Hello world [EOS] [AUDIO_TOKEN_284] [AUDIO_TOKEN_156] ... [EOS]

Inference:
  LLM generates interleaved text + audio tokens
  Post-processing strips text tokens, decodes audio tokens

Model: Orpheus, ChatTTS, CosyVoice 2 (hybrid)

Strengths:

Rich contextual understanding — leverages the full LLM’s linguistic capability
Zero-shot generalization — heteronyms, code-switching, few-shot conditioning
Streaming native — autoregressive by design
Full Llama ecosystem — vLLM, llama.cpp, LoRA, GGUF

Weaknesses:

Large models — 3B parameters minimum, requires GPU
No explicit prosody control — emotive tags are coarse approximations
Training cost — must train the entire LLM on speech data
Repetition/stability issues — need careful sampling parameters (repetition_penalty >= 1.1)

Stage 5: Waveform Synthesis (Vocoding)

The final stage converts acoustic representations into raw audio waveforms. The vocoder determines the fine-grained audio quality — sample-level fidelity, noise characteristics, and spectral detail.

GAN-Based Vocoders

HiFi-GAN (used by Chatterbox, CosyVoice 2):

HiFi-GAN is the most widely used vocoder in modern TTS. It consists of:

Generator: Transposed convolution blocks that upsample mel-spectrogram frames to waveform samples
Multi-Period Discriminator (MPD): Checks for periodicity at different time scales — catches repeating artifacts
Multi-Scale Discriminator (MSD): Checks at different audio resolutions — catches broadband noise

# HiFi-GAN generator upsampling path (conceptual)
mel → Conv1D → transposed_conv1 (×4) → transposed_conv2 (×4) → 
      transposed_conv3 (×2) → transposed_conv4 (×2) → linear → waveform
# Total: 4×4×2×2 = 64× upsampling from mel frame rate to audio sample rate

Training uses a combination of losses:

Adversarial loss: Generator tries to fool discriminators
Mel-spectrogram loss: L1 between generated and target mel
Feature matching loss: L1 between discriminator feature maps

MelGAN: Earlier GAN vocoder, simpler discriminators, lower quality than HiFi-GAN.

ISTFT-Based

ISTFTNet (used by Kokoro):

Inverse Short-Time Fourier Transform networks learn to predict both magnitude and phase from mel-spectrograms, then apply the ISTFT analytically:

# ISTFTNet approach (conceptual)
mel → [Conv stack] → magnitude_spectrum, phase_spectrum
complex_spectrum = magnitude * exp(i * phase)
waveform = istft(complex_spectrum)  # Analytical, no learned parameters

Why it matters: ISTFTNet bypasses the need for a GAN discriminator entirely. The ISTFT layer is deterministic and non-trainable — the network only learns to predict the spectrogram components. This makes training simpler and inference very fast.

Neural Codec Decoders

SNAC decoder (used by Orpheus):

The SNAC decoder reconstructs audio from discrete codebook indices. It takes the 3-layer hierarchical codes and produces 24kHz audio through:

codes_0, codes_1, codes_2 → [SNAC decoder] → 24kHz waveform

The decoder is a separate pre-trained model (hubertsiuzdak/snac_24khz) that remains frozen during TTS training. The TTS model only learns to produce the codes.

EnCodec decoder, DAC decoder: Similar principle — a frozen, pre-trained decoder that turns codec tokens into audio.

Traditional Vocoders (Legacy)

Vocoder	Method	Quality	Speed
Griffin-Lim	Iterative phase reconstruction from magnitude	Low	Slow
WORLD	F0 + spectral envelope + aperiodicity	Medium	Fast
WaveNet	Autoregressive sample-level prediction	High	Very slow

Griffin-Lim is still used as a fallback in some pipelines because it has no learned parameters — it works from any magnitude spectrogram.

How Real Models Map to This Pipeline

Stage	Kokoro-82M	Orpheus-3B	Chatterbox-500M	Qwen3-TTS-1.7B
Normalization	Misaki G2P (built-in)	Expects clean input	Minimal internal	Minimal internal
G2P	Misaki (phonemes)	Implicit (LLM learned)	Implicit (S3 tokens)	Implicit (codec tokens)
Prosody	Implicit (AdaIN style)	Implicit (LLM)	Explicit (emotion vector)	Instruction-based
Acoustic gen	StyleTTS 2 (parallel)	Llama AR (token AR)	Llama AR + CFM	Dual-track AR
Acoustic output	Mel-spectrogram	SNAC codes	S3 tokens → Mel	12Hz codec codes
Vocoder	ISTFTNet	SNAC decoder	HiFi-GAN	Causal ConvNet
Streaming	No	Yes, implementation-dependent	Chunked	Yes, implementation-dependent
Explicit G2P?	Yes	No	No	No

A common pattern is that smaller models often use explicit G2P (Kokoro with Misaki), while larger token-based models often learn more of the mapping implicitly (Orpheus, Qwen3-TTS, Chatterbox). Explicit modules make some pronunciation errors easier to inspect and patch; implicit models can handle more context but are harder to debug.

End-to-End Trace: One Sentence Through Two Pipelines

Sentence: “He read about the API in 2024.”

Through Kokoro (explicit stages)

Step 1 — Text Normalization:
  Input:  "He read about the API in 2024."
  Normalization rules apply:
    - "read" → context check → /ɹɛd/ (past tense, "read about")
    - "API"  → initialism → /eɪ pi aɪ/
    - "2024" → year → "twenty twenty-four"
  Output: "He read about the API in twenty twenty-four."

Step 2 — G2P (Misaki):
  Phonemes: /hi ɹɛd əˈbaʊt ðə eɪ pi aɪ ɪn ˈtwɛnti ˈtwɛnti fɔːɹ/
  Stress markers on "twɛnti" and "fɔːɹ"

Step 3 — Acoustic Generation (StyleTTS 2):
  Phonemes + voice embedding → StyleTTS 2 encoder 
  → AdaIN with style vector → mel-spectrogram (80-channel, ~50 Hz)

Step 4 — Vocoding (ISTFTNet):
  Mel-spectrogram → predict STFT magnitude + phase
  → ISTFT → 24kHz PCM waveform

Through Orpheus (implicit, LLM-based)

Step 1 — Prompt Formatting:
  Raw text wrapped in prompt template:
  "tara: He read about the API in 2024."
  (No explicit normalization — the LLM is expected to handle it)

Step 2 — Tokenization:
  Llama tokenizer encodes the text prompt:
  [BOS] tara: He read about the API in 2024. [EOS]

Step 3 — Autoregressive Generation:
  Llama-3.2-3B predicts next-token distribution at each step:
  Step 1:  P(custom_token_28431 | prompt)  ← selects coarse SNAC token
  Step 2:  P(custom_token_15672 | prompt, t28431)
  ...
  Step N:  P(EOS | ...)

  The model must internalize:
  - Past tense reading of "read" → appropriate phoneme sequence
  - "API" as letter names, not the word "appy"
  - "2024" as a year, not a number
  - Appropriate prosody for a declarative statement

Step 4 — SNAC Decoding:
  Token sequence windowed into 28-token frames
  → Deinterleaved into codes_0, codes_1, codes_2
  → SNAC decoder → 24kHz PCM

The Kokoro trace is more inspectable because the G2P output is explicit — many pronunciation errors can be addressed by changing the text or pronunciation handling. The Orpheus trace is end-to-end learned, so pronunciation is less transparent and may require prompting, respelling, fine-tuning, or model changes.

Where Pipeline Decisions Affect Quality

Different stages dominate different quality dimensions:

Dimension	Dominant Stage	Why
Intelligibility	G2P (Stage 2)	Wrong phonemes produce wrong words
Naturalness	Acoustic + Prosody (Stages 3-4)	Robotic prosody sounds artificial even with perfect G2P
Audio Fidelity	Vocoder (Stage 5)	The vocoder determines sample-level quality
Pacing	Prosody + Acoustic (Stages 3-4)	Duration prediction controls speaking rate
Expressiveness	Prosody + Acoustic (Stages 3-4)	Pitch range, stress, and emotion come from here
Consistency	Normalization (Stage 1)	Inconsistent text preprocessing produces inconsistent output
Latency	Acoustic (Stage 4)	Autoregressive vs parallel vs iterative determines speed
Pronunciation	G2P (Stage 2)	Heteronyms, names, loanwords — all G2P decisions

Practical implications

If the word is wrong: The problem is often in G2P (or the implicit equivalent). In explicit G2P systems (Kokoro), you may be able to fix this by adding dictionary entries or phonetically respelling the word. In implicit systems (Orpheus, Qwen3-TTS), fixes usually rely on prompting, respelling, fine-tuning, or model changes.

If the voice sounds robotic: The problem is in the acoustic model or prosody. Parallel models (Kokoro) tend toward flatter prosody. Autoregressive models (Orpheus) produce more natural prosody but can loop or hallucinate. Flow matching (Chatterbox) sits between them.

If the audio has artifacts/static: The problem is in the vocoder. GAN vocoders can produce artifacts on out-of-distribution inputs. Neural codec decoders (SNAC) are more stable but have lower sample rates (24kHz vs 44.1kHz).

If the output is too slow or too fast: This can come from duration prediction (Kokoro, parallel models), repetition penalty settings (Orpheus), or inference hyperparameters.

Summary

The TTS pipeline is a sequence of five transformations:

Stage	Transformation	Approaches
1. Normalization	Raw text → Clean text	Rules, neural, hybrid
2. G2P	Text → Phonemes	Dictionary, rule, neural, implicit (LLM)
3. Prosody	Phonemes → Duration/pitch/stress	Explicit prediction, implicit (learned)
4. Acoustic Generation	Features → Acoustic representation	AR tokens, flow matching, diffusion, parallel, LLM
5. Vocoding	Acoustic → Waveform	HiFi-GAN, ISTFTNet, codec decoder

No single approach dominates across all dimensions. The tradeoff space is defined by where each model puts the burden:

Kokoro puts the burden on explicit linguistic stages (normalization, G2P) and a fast parallel acoustic model — easy to fix pronunciation, harder to achieve natural prosody
Orpheus puts much of the burden on the LLM — natural prosody can emerge from scale, but pronunciation is less transparent than in explicit G2P systems
Chatterbox splits it between a Llama-based token predictor and a flow-matching mel decoder — expression can be steered through conditioning
Qwen3-TTS uses a dual-track LM feeding a lightweight codec decoder — designed for low-latency streaming at the cost of a more complex training setup

The choice of pipeline architecture determines the failure modes, the fixability, and the deployment characteristics of the system — often more than the parameter count or the training data size.

How TTS Actually Works — The Pipeline

The Pipeline at a Glance

Stage 1: Text Normalization

What needs normalizing

How TTS systems handle it

Stage 2: Grapheme-to-Phoneme (G2P)

Phoneme inventories

The heteronym problem

G2P approaches

Kokoro’s approach

Orpheus’s approach

Stage 3: Prosody Prediction

What prosody covers

Traditional prosody prediction

Modern prosody prediction

Stage 4: Acoustic Generation

Paradigm 1: Autoregressive Token-Based

Paradigm 2: Flow Matching

Paradigm 3: Diffusion

Paradigm 4: Parallel Acoustic Models (Non-Autoregressive)

Paradigm 5: LLM-Based (Unified Language Model)

Stage 5: Waveform Synthesis (Vocoding)

GAN-Based Vocoders

ISTFT-Based

Neural Codec Decoders

Traditional Vocoders (Legacy)

How Real Models Map to This Pipeline

End-to-End Trace: One Sentence Through Two Pipelines

Sentence: “He read about the API in 2024.”

Through Kokoro (explicit stages)

Through Orpheus (implicit, LLM-based)

Where Pipeline Decisions Affect Quality

Practical implications

Summary

More from the blog

Try Spokio for Mac.

Product

Features

Use Cases

Compare