Text-to-speech looks like magic from the outside — text goes in, speech comes out. But the transformation from a string of characters to an audible waveform is a sequence of distinct processing stages, each with its own engineering challenges and failure modes.
This article walks through every stage of the modern TTS pipeline, from raw text to final audio. It covers what each stage does, how different architectures implement it, and where the real tradeoffs are.
The Pipeline at a Glance
Every TTS system, regardless of architecture, performs these transformations:
Raw text (Unicode string)
→ Stage 1: Text Normalization
→ Normalized text
→ Stage 2: Grapheme-to-Phoneme (G2P)
→ Phoneme sequence + linguistic features
→ Stage 3: Prosody Prediction
→ Duration, pitch, energy contours
→ Stage 4: Acoustic Generation
→ Acoustic features or audio tokens
→ Stage 5: Waveform Synthesis (Vocoding)
→ PCM audio (WAV / MP3 / raw samples)Stages 1-3 are linguistic processing — they turn text into structured linguistic representations. Stages 4-5 are acoustic processing — they turn linguistic representations into sound.
Modern end-to-end models collapse some of these stages into neural networks that learn the transformations implicitly, but the underlying information flow is the same. Understanding each stage explains why TTS systems behave the way they do, and where the remaining failure modes live.
Stage 1: Text Normalization
Text normalization is the most underestimated stage in the pipeline. It takes raw input text and converts it into a form that downstream stages can process predictably.
What needs normalizing
Case and whitespace:
" HELLO WORLD " → "hello world"Unicode normalization: Different Unicode representations of the same character (e.g., composed vs. decomposed accented letters) must be collapsed into a canonical form, typically NFC or NFD.
# "é" can be U+00E9 (single codepoint) or U+0065 U+0301 (e + combining acute)
"\u00e9" → "e" # NFC normalizationNumber expansion: Numbers must be expanded into words appropriate to their context:
| Input | Context | Expansion |
|---|---|---|
| 42 | “The answer is 42” | “forty-two” |
| 2026 | “In 2026 we launched” | “twenty twenty-six” |
| 2026 | “The total is 2026” | “two thousand twenty-six” |
| $49.99 | “It costs $49.99” | “forty-nine dollars and ninety-nine cents” |
| 3.14 | “Pi is 3.14” | “three point one four” |
| +1 (555) 123-4567 | “Call +1 555…” | “plus one five five five one two three four five six seven” |
Context-dependent expansion requires semantic understanding. “1.2.1” in a version number may be read as “one point two point one” but “1.2.1” in a decimal context does not make sense. Many TTS systems use regex cascades or finite-state transducers with language-specific rules; some systems add neural classifiers to disambiguate context.
Abbreviation and acronym expansion:
| Abbreviation | Spoken form |
|---|---|
| “Dr.” | “doctor” or “drive” |
| “St.” | “saint” or “street” |
| “e.g.” | “for example” |
| “i.e.” | “that is” |
| “vs.” | “versus” |
| “NASA” | “nasa” (as a word) |
| “API” | “A P I” (letter by letter) |
| “UNICEF” | “you-ni-cef” (as a word) |
The abbreviation/acronym disambiguation problem is genuinely hard. “NASA” is acronymic (spoken as a word), “FBI” is initialistic (spoken letter by letter), and “URL” is contested (some say “earl”, most say “U R L”).
Date and time expansion:
"2024-03-15" → "March fifteenth, twenty twenty-four" # US convention
"2024-03-15" → "the fifteenth of March, twenty twenty-four" # UK convention
"3:45 PM" → "three forty-five PM"
"3:45" → "three forty-five" or "quarter to four"URL and email handling:
"https://github.com/user/repo" → "github dot com slash user slash repo"
"user@example.com" → "user at example dot com"Special characters and symbols:
"&" → "and"
"%" → "percent"
"±" → "plus or minus"
"→" → "arrow" or "leads to"
"#" → "number" (before a numeral) or "hashtag" (social media)How TTS systems handle it
- Traditional: Rule-based normalization cascades (eSpeak, Festival, MaryTTS) with hundreds to thousands of regex patterns
- Neural: Seq2seq models trained on (raw, normalized) pairs — used by Google’s TTS, Tacotron 2, modern systems
- Hybrid: Rules for deterministic cases (dates, numbers) + neural for ambiguous cases
- Kokoro: Delegates to the
MisakiG2P engine which includes normalization - Orpheus, Qwen3-TTS, Chatterbox: Expect pre-normalized input; do minimal normalization internally
Failure mode: A TTS system that sounds great on clean sentences will break on “The API v2.1 costs $9.99/mo, saving you ~15%!” unless normalization handles abbreviations, version numbers, currency, per-month, and approximation symbols correctly.
Stage 2: Grapheme-to-Phoneme (G2P)
Grapheme-to-phoneme conversion maps written characters to the sounds they represent. This is the bridge between orthography and pronunciation.
Phoneme inventories
Different languages use different phoneme sets. English TTS commonly uses:
- ARPABET — 39-40 phonemes, ASCII-only, common in US TTS research:
KAA1 Tfor “cat” - IPA — International Phonetic Alphabet, precise but requires Unicode:
/kæt/ - Custom — Each system often builds its own reduced set tuned to the training data
| Word | ARPABET | IPA |
|---|---|---|
| cat | K AE1 T | /kæt/ |
| dog | D AO1 G | /dɒɡ/ |
| thought | TH AO1 T | /θɔːt/ |
| rhythm | R IH1 DH AH0 M | /ˈɹɪðəm/ |
The heteronym problem
Heteronyms are words spelled identically but pronounced differently based on meaning:
| Word | Pronunciation | Context |
|---|---|---|
| read | /ɹɛd/ (past tense) | “I read that book yesterday” |
| read | /ɹid/ (present tense) | “I read books every day” |
| lead | /lɛd/ (the metal) | “pencil lead is toxic” |
| lead | /lid/ (to guide) | “she will lead the team” |
| bass | /beɪs/ (fish) | “caught a large bass” |
| bass | /beɪs/ (low tone) | “turn up the bass” |
| wound | /wund/ (injury) | “the wound healed slowly” |
| wound | /waʊnd/ (twisted) | “she wound the clock” |
Traditional G2P systems handle this with part-of-speech tagging and lookup tables. Neural G2P systems learn contextual disambiguation from training data — if they have enough examples of each reading.
G2P approaches
Rule-based (eSpeak, Festival): Hand-written pronunciation rules with exception dictionaries. Predictable output, zero variance between runs. Brittle for names, loanwords, and neologisms.
# Conceptual eSpeak-style rule
"c" → /s/ before e, i, y
"c" → /k/ before a, o, u, consonantsDictionary-based (CMUdict, Moby): Large lookup tables mapping full words to phoneme sequences. High accuracy for known words, zero coverage for unknown ones. CMU Pronouncing Dictionary covers ~134,000 English words.
# CMUdict entry for "technology"
T EH0 K N AA1 L AH0 JH IY0Neural (phonemizer + backend, Misaki, Transformer G2P): Seq2seq transformer trained on (grapheme, phoneme) pairs. Generalizes to unseen words. Can learn context from surrounding words. Requires training data.
# Misaki (used by Kokoro)
from misaki import en
g2p = en.G2P()
phonemes = g2p("The technology works")
# Returns phoneme string with stress markersKokoro’s approach
Kokoro uses Misaki G2P with language-specific backends. The lang_code parameter selects the phoneme inventory and rule set:
"a"— American English"b"— British English"j"— Japanese"k"— Korean"z"— Mandarin Chinese
The G2P output is a phoneme string passed directly to the acoustic model. Misaki handles context-dependent rules, stress assignment, and basic text normalization in a single pass.
Orpheus’s approach
Orpheus skips explicit G2P entirely. The Llama-3.2-3B backbone is trained on interleaved text and SNAC audio tokens. The model learns grapheme-to-phoneme mapping implicitly through the next-token prediction objective. This is the key architectural difference: instead of a separate G2P module, the LLM internalizes the mapping.
This works because the model sees text tokens paired with corresponding audio tokens during training. It can learn that “read” in one context produces one audio token sequence and in another context produces a different one. This is why Orpheus can sometimes handle heteronyms without an explicit disambiguation module — but it also means pronunciation errors are harder to patch with a simple dictionary edit.
Stage 3: Prosody Prediction
Prosody is the rhythm, stress, and intonation of speech — the patterns that make it sound human rather than flat and robotic.
What prosody covers
Duration: How long each phoneme lasts. A sentence-final syllable is typically lengthened. Unstressed vowels are shorter than stressed ones.
Pitch contour: The fundamental frequency (F0) movement. Questions rise in pitch at the end. Statements fall. Excitement raises overall pitch range. Boredom compresses it.
Pauses: Silence between phrases, sentences, and paragraphs. Not all commas produce pauses, and not all pauses come from commas.
Stress: Emphasizing certain words or syllables to convey meaning. Compare:
- “I didn’t take the money” (denial)
- “I didn’t take your money” (specific denial)
- “I didn’t take your money” (took something else)
Traditional prosody prediction
Older TTS systems (Diphone synthesis, HMM-based synthesis) used dedicated prosody models:
# Conceptual: duration prediction from linguistic features
phone_duration_ms = f(
phone_identity, # /æ/ vs /i/
syllable_stress, # primary, secondary, unstressed
phrase_position, # initial, medial, final
sentence_type, # statement, question, exclamation
speaking_rate # slow, normal, fast
)Pitch contours were predicted using superpositional models:
- Phrase curve: Slow rise then fall over the utterance
- Accent curve: Local pitch movements on stressed syllables
- Micro-prosody: Perturbations caused by specific phonemes (voiceless consonants raise pitch)
Modern prosody prediction
End-to-end models internalize prosody prediction into the acoustic generation stage. The model learns from data which prosodic patterns correspond to which text features. This produces more natural prosody than hand-crafted rules, but it also means the model’s prosody is only as good as the training data — and it may not respect explicit user intent.
Some models add explicit prosody control:
- Chatterbox: expression or exaggeration conditioning can steer emotional delivery, depending on implementation
- Orpheus: Emotive tags (
<laugh>,<sigh>,<gasp>) inject explicit prosodic events - Qwen3-TTS: Instruction following for style (“speak excitedly”, “whisper”)
These work by conditioning the model on additional vectors during generation, steering the implicit prosody prediction toward the desired target.
Stage 4: Acoustic Generation
This is the core of the TTS pipeline — the stage that transforms linguistic features into acoustic representations. The approach used here defines the architecture of the entire system.
Paradigm 1: Autoregressive Token-Based
Predict discrete audio tokens one step at a time, using a language model.
How it works:
Token stream: [T1] [T2] [T3] ... [Tn]
Autoregressive LM predicts each token conditioned on all previous tokens:
P(T2 | T1), P(T3 | T1, T2), ...
Audio codec decodes token sequence → waveformModel: AudioLM, VALL-E, CosyVoice, Orpheus
Tokenizers: Audio codecs that discretize continuous audio into tokens:
| Codec | Bitrate | Tokens/frame | Frame rate | Sample rate |
|---|---|---|---|---|
| SNAC | ~7.3 kbps | 7 | 25 Hz (L0) | 24 kHz |
| EnCodec | 1.5-24 kbps | 4-32 | 50 Hz | 32 kHz |
| DAC | 8-32 kbps | 4-12 | 50 Hz | 44.1 kHz |
| S3 (Chatterbox) | ~6.4 kbps | 8 | 25 Hz | 16 kHz |
| Qwen-TTS-12Hz | ~12 kbps | 16 | 12.5 Hz | 24 kHz |
Strengths:
- Natural prosody — the language model captures long-range acoustic patterns
- Streaming friendly — tokens produced one at a time
- Simple training — standard next-token prediction loss
Weaknesses:
- Autoregressive latency — must generate one token at a time
- Exposure bias — training with teacher forcing, inference with sampled tokens
- Repetition loops — the model can get stuck on repeating tokens
- Error accumulation — a bad early token corrupts everything after it
Code example (Orpheus conceptual):
# Autoregressive token generation in Orpheus with vLLM
tokens = []
for step in range(max_tokens):
logits = llm.forward(tokens)
next_token = sample(logits[-1], temperature=0.6, top_p=0.9)
tokens.append(next_token)
if next_token == stop_token_id:
break
# tokens → SNAC decoder → 24kHz audioParadigm 2: Flow Matching
Learn a continuous vector field that transforms noise into the target acoustic distribution.
How it works:
Start: Gaussian noise z ~ N(0, I)
Learn: Vector field v(t, x) that pushes z toward the data distribution
Inference: Solve ODE from t=0 to t=1
x_0 = z
x_{t+dt} = x_t + v(t, x_t) * dt
x_1 ≈ target mel-spectrogramModel: Chatterbox (S3Token2Mel), VoiceBox, NaturalSpeech 3, CosyVoice 2
Architecture details (Chatterbox):
# Chatterbox flow matching: 10-step Euler solver
def solve_euler(self, x, t_span, mu, mask, spks, cond):
for i in range(n_timesteps): # 10 steps
dt = t_span[i+1] - t_span[i]
dxdt = self.estimator(x, t_span[i], mu, mask, spks, cond)
x = x + dt * dxdt
return x # final mel-spectrogramStrengths:
- High quality — flow matching produces excellent fidelity
- Controllable — conditioning vectors directly steer generation
- Stable training — no adversarial learning, no mode collapse
- Fixed inference cost — always N steps, regardless of input length
Weaknesses:
- Slow inference — 10-50 ODE steps needed for quality
- Not streaming friendly — full spectrogram is generated before output
- Computationally heavier — multiple forward passes per utterance
Distillation (Turbo trick): Chatterbox Turbo uses distillation-style optimization to reduce the cost of multi-step generation, trading some quality for faster inference.
Paradigm 3: Diffusion
Iteratively denoise Gaussian noise into the target distribution.
How it works:
Training: Add noise to mel-spectrograms, learn to predict noise
Inference: Start from pure noise, iteratively denoise:
x_T = N(0, I) # pure noise
x_{T-1} = denoise(x_T)
...
x_0 = target spectrogramModel: WaveGrad, DiffWave, Grad-TTS
Strengths:
- High quality — diffusion can produce strong generative fidelity
- Good coverage — doesn’t collapse to average outputs
Weaknesses:
- Slow — 50-1000 denoising steps at inference
- Computationally expensive — massive memory and FLOPs
- Largely replaced by flow matching in modern TTS
Paradigm 4: Parallel Acoustic Models (Non-Autoregressive)
Transform text directly to acoustic features in a single forward pass using alignment mechanisms.
How it works:
Text → [Encoder] → Hidden states → [Duration predictor] → Expanded states → [Decoder] → Mel-spectrogram- Duration predictor estimates how long each input unit lasts
- The expanded states are then decoded to mel-spectrogram in parallel
Model: FastSpeech 1/2, StyleTTS 2, Glow-TTS
StyleTTS 2 specific approach: StyleTTS 2 uses AdaIN (Adaptive Instance Normalization) to separate content and style. A style encoder extracts prosodic style from a reference utterance, while the content encoder processes the text. The two streams combine through AdaIN layers to produce the output.
# StyleTTS 2 high-level architecture
content = content_encoder(text)
style = style_encoder(reference_audio)
# AdaIN: normalize content statistics to match style statistics
output = adain(content, style) # align mean and varianceStrengths:
- Fast — single forward pass, no iterative sampling
- Controllable style — explicit separation of content and style
- Good for its size — Kokoro-82M uses this paradigm
Weaknesses:
- Alignment is fragile — duration prediction errors cause robotic timing
- Less natural than autoregressive for very expressive speech
- Fixed output length — no natural streaming
Paradigm 5: LLM-Based (Unified Language Model)
Extend a large language model’s vocabulary with audio codec tokens and train on interleaved text+audio sequences.
How it works:
Vocabulary:
Original: [the, cat, sat, ...] + system tokens
Extended: [text_tokens ...] + [custom_token_1, custom_token_2, ..., custom_token_N]
Training example:
[BOS] tara: Hello world [EOS] [AUDIO_TOKEN_284] [AUDIO_TOKEN_156] ... [EOS]
Inference:
LLM generates interleaved text + audio tokens
Post-processing strips text tokens, decodes audio tokensModel: Orpheus, ChatTTS, CosyVoice 2 (hybrid)
Strengths:
- Rich contextual understanding — leverages the full LLM’s linguistic capability
- Zero-shot generalization — heteronyms, code-switching, few-shot conditioning
- Streaming native — autoregressive by design
- Full Llama ecosystem — vLLM, llama.cpp, LoRA, GGUF
Weaknesses:
- Large models — 3B parameters minimum, requires GPU
- No explicit prosody control — emotive tags are coarse approximations
- Training cost — must train the entire LLM on speech data
- Repetition/stability issues — need careful sampling parameters (repetition_penalty >= 1.1)
Stage 5: Waveform Synthesis (Vocoding)
The final stage converts acoustic representations into raw audio waveforms. The vocoder determines the fine-grained audio quality — sample-level fidelity, noise characteristics, and spectral detail.
GAN-Based Vocoders
HiFi-GAN (used by Chatterbox, CosyVoice 2):
HiFi-GAN is the most widely used vocoder in modern TTS. It consists of:
- Generator: Transposed convolution blocks that upsample mel-spectrogram frames to waveform samples
- Multi-Period Discriminator (MPD): Checks for periodicity at different time scales — catches repeating artifacts
- Multi-Scale Discriminator (MSD): Checks at different audio resolutions — catches broadband noise
# HiFi-GAN generator upsampling path (conceptual)
mel → Conv1D → transposed_conv1 (×4) → transposed_conv2 (×4) →
transposed_conv3 (×2) → transposed_conv4 (×2) → linear → waveform
# Total: 4×4×2×2 = 64× upsampling from mel frame rate to audio sample rateTraining uses a combination of losses:
- Adversarial loss: Generator tries to fool discriminators
- Mel-spectrogram loss: L1 between generated and target mel
- Feature matching loss: L1 between discriminator feature maps
MelGAN: Earlier GAN vocoder, simpler discriminators, lower quality than HiFi-GAN.
ISTFT-Based
ISTFTNet (used by Kokoro):
Inverse Short-Time Fourier Transform networks learn to predict both magnitude and phase from mel-spectrograms, then apply the ISTFT analytically:
# ISTFTNet approach (conceptual)
mel → [Conv stack] → magnitude_spectrum, phase_spectrum
complex_spectrum = magnitude * exp(i * phase)
waveform = istft(complex_spectrum) # Analytical, no learned parametersWhy it matters: ISTFTNet bypasses the need for a GAN discriminator entirely. The ISTFT layer is deterministic and non-trainable — the network only learns to predict the spectrogram components. This makes training simpler and inference very fast.
Neural Codec Decoders
SNAC decoder (used by Orpheus):
The SNAC decoder reconstructs audio from discrete codebook indices. It takes the 3-layer hierarchical codes and produces 24kHz audio through:
codes_0, codes_1, codes_2 → [SNAC decoder] → 24kHz waveformThe decoder is a separate pre-trained model (hubertsiuzdak/snac_24khz) that remains frozen during TTS training. The TTS model only learns to produce the codes.
EnCodec decoder, DAC decoder: Similar principle — a frozen, pre-trained decoder that turns codec tokens into audio.
Traditional Vocoders (Legacy)
| Vocoder | Method | Quality | Speed |
|---|---|---|---|
| Griffin-Lim | Iterative phase reconstruction from magnitude | Low | Slow |
| WORLD | F0 + spectral envelope + aperiodicity | Medium | Fast |
| WaveNet | Autoregressive sample-level prediction | High | Very slow |
Griffin-Lim is still used as a fallback in some pipelines because it has no learned parameters — it works from any magnitude spectrogram.
How Real Models Map to This Pipeline
| Stage | Kokoro-82M | Orpheus-3B | Chatterbox-500M | Qwen3-TTS-1.7B |
|---|---|---|---|---|
| Normalization | Misaki G2P (built-in) | Expects clean input | Minimal internal | Minimal internal |
| G2P | Misaki (phonemes) | Implicit (LLM learned) | Implicit (S3 tokens) | Implicit (codec tokens) |
| Prosody | Implicit (AdaIN style) | Implicit (LLM) | Explicit (emotion vector) | Instruction-based |
| Acoustic gen | StyleTTS 2 (parallel) | Llama AR (token AR) | Llama AR + CFM | Dual-track AR |
| Acoustic output | Mel-spectrogram | SNAC codes | S3 tokens → Mel | 12Hz codec codes |
| Vocoder | ISTFTNet | SNAC decoder | HiFi-GAN | Causal ConvNet |
| Streaming | No | Yes, implementation-dependent | Chunked | Yes, implementation-dependent |
| Explicit G2P? | Yes | No | No | No |
A common pattern is that smaller models often use explicit G2P (Kokoro with Misaki), while larger token-based models often learn more of the mapping implicitly (Orpheus, Qwen3-TTS, Chatterbox). Explicit modules make some pronunciation errors easier to inspect and patch; implicit models can handle more context but are harder to debug.
End-to-End Trace: One Sentence Through Two Pipelines
Sentence: “He read about the API in 2024.”
Through Kokoro (explicit stages)
Step 1 — Text Normalization:
Input: "He read about the API in 2024."
Normalization rules apply:
- "read" → context check → /ɹɛd/ (past tense, "read about")
- "API" → initialism → /eɪ pi aɪ/
- "2024" → year → "twenty twenty-four"
Output: "He read about the API in twenty twenty-four."
Step 2 — G2P (Misaki):
Phonemes: /hi ɹɛd əˈbaʊt ðə eɪ pi aɪ ɪn ˈtwɛnti ˈtwɛnti fɔːɹ/
Stress markers on "twɛnti" and "fɔːɹ"
Step 3 — Acoustic Generation (StyleTTS 2):
Phonemes + voice embedding → StyleTTS 2 encoder
→ AdaIN with style vector → mel-spectrogram (80-channel, ~50 Hz)
Step 4 — Vocoding (ISTFTNet):
Mel-spectrogram → predict STFT magnitude + phase
→ ISTFT → 24kHz PCM waveformThrough Orpheus (implicit, LLM-based)
Step 1 — Prompt Formatting:
Raw text wrapped in prompt template:
"tara: He read about the API in 2024."
(No explicit normalization — the LLM is expected to handle it)
Step 2 — Tokenization:
Llama tokenizer encodes the text prompt:
[BOS] tara: He read about the API in 2024. [EOS]
Step 3 — Autoregressive Generation:
Llama-3.2-3B predicts next-token distribution at each step:
Step 1: P(custom_token_28431 | prompt) ← selects coarse SNAC token
Step 2: P(custom_token_15672 | prompt, t28431)
...
Step N: P(EOS | ...)
The model must internalize:
- Past tense reading of "read" → appropriate phoneme sequence
- "API" as letter names, not the word "appy"
- "2024" as a year, not a number
- Appropriate prosody for a declarative statement
Step 4 — SNAC Decoding:
Token sequence windowed into 28-token frames
→ Deinterleaved into codes_0, codes_1, codes_2
→ SNAC decoder → 24kHz PCMThe Kokoro trace is more inspectable because the G2P output is explicit — many pronunciation errors can be addressed by changing the text or pronunciation handling. The Orpheus trace is end-to-end learned, so pronunciation is less transparent and may require prompting, respelling, fine-tuning, or model changes.
Where Pipeline Decisions Affect Quality
Different stages dominate different quality dimensions:
| Dimension | Dominant Stage | Why |
|---|---|---|
| Intelligibility | G2P (Stage 2) | Wrong phonemes produce wrong words |
| Naturalness | Acoustic + Prosody (Stages 3-4) | Robotic prosody sounds artificial even with perfect G2P |
| Audio Fidelity | Vocoder (Stage 5) | The vocoder determines sample-level quality |
| Pacing | Prosody + Acoustic (Stages 3-4) | Duration prediction controls speaking rate |
| Expressiveness | Prosody + Acoustic (Stages 3-4) | Pitch range, stress, and emotion come from here |
| Consistency | Normalization (Stage 1) | Inconsistent text preprocessing produces inconsistent output |
| Latency | Acoustic (Stage 4) | Autoregressive vs parallel vs iterative determines speed |
| Pronunciation | G2P (Stage 2) | Heteronyms, names, loanwords — all G2P decisions |
Practical implications
If the word is wrong: The problem is often in G2P (or the implicit equivalent). In explicit G2P systems (Kokoro), you may be able to fix this by adding dictionary entries or phonetically respelling the word. In implicit systems (Orpheus, Qwen3-TTS), fixes usually rely on prompting, respelling, fine-tuning, or model changes.
If the voice sounds robotic: The problem is in the acoustic model or prosody. Parallel models (Kokoro) tend toward flatter prosody. Autoregressive models (Orpheus) produce more natural prosody but can loop or hallucinate. Flow matching (Chatterbox) sits between them.
If the audio has artifacts/static: The problem is in the vocoder. GAN vocoders can produce artifacts on out-of-distribution inputs. Neural codec decoders (SNAC) are more stable but have lower sample rates (24kHz vs 44.1kHz).
If the output is too slow or too fast: This can come from duration prediction (Kokoro, parallel models), repetition penalty settings (Orpheus), or inference hyperparameters.
Summary
The TTS pipeline is a sequence of five transformations:
| Stage | Transformation | Approaches |
|---|---|---|
| 1. Normalization | Raw text → Clean text | Rules, neural, hybrid |
| 2. G2P | Text → Phonemes | Dictionary, rule, neural, implicit (LLM) |
| 3. Prosody | Phonemes → Duration/pitch/stress | Explicit prediction, implicit (learned) |
| 4. Acoustic Generation | Features → Acoustic representation | AR tokens, flow matching, diffusion, parallel, LLM |
| 5. Vocoding | Acoustic → Waveform | HiFi-GAN, ISTFTNet, codec decoder |
No single approach dominates across all dimensions. The tradeoff space is defined by where each model puts the burden:
- Kokoro puts the burden on explicit linguistic stages (normalization, G2P) and a fast parallel acoustic model — easy to fix pronunciation, harder to achieve natural prosody
- Orpheus puts much of the burden on the LLM — natural prosody can emerge from scale, but pronunciation is less transparent than in explicit G2P systems
- Chatterbox splits it between a Llama-based token predictor and a flow-matching mel decoder — expression can be steered through conditioning
- Qwen3-TTS uses a dual-track LM feeding a lightweight codec decoder — designed for low-latency streaming at the cost of a more complex training setup
The choice of pipeline architecture determines the failure modes, the fixability, and the deployment characteristics of the system — often more than the parameter count or the training data size.
