text-to-speechdevelopersmachine learningon-device aideep learningvoice synthesisprimer

Text-to-Speech Technology Explained: From Waveforms to Neural Networks

A comprehensive technical explainer of modern text-to-speech: concatenative synthesis, parametric models, neural end-to-end architectures, LLM-based TTS, flow matching, voice cloning, vocoders, evaluation benchmarks, and local vs cloud tradeoffs.

Published on Apr 16, 202615 min read

Text-to-speech is one of those technologies that feels like magic until you understand the pipeline. Then it feels like elegant engineering.

The goal is simple: take written text and produce audible speech that sounds natural. The path from text to waveform has gone through three broad eras, each building on the insights of the previous one. Today, some local models can produce speech that sounds natural enough for practical voiceover and drafting workflows, using architectures that barely existed five years ago.

This article explains how TTS works from the ground up — the signal processing foundations, the neural architectures that replaced them, the modern hybrid systems, and the practical tradeoffs that determine which approach to use.

What TTS Actually Does

At the most abstract level, TTS is a sequence-to-sequence problem: given a sequence of characters or phonemes, produce a sequence of audio samples that a human listener perceives as speech.

The difficulty is that this mapping is massively one-to-many. The same sentence can be spoken in infinite ways — different pitches, rhythms, emphases, emotions, accents. A TTS system must choose one plausible rendering and make it sound like it was produced by a human vocal tract.

The Three Eras of TTS

Era 1: Concatenative Synthesis (1970s–2000s)

Concatenative TTS worked by stitching together pre-recorded segments of a human voice. A studio recording was segmented into phones (the smallest units of speech), diphones (transitions between phones), or sometimes entire syllables. At runtime, the system selected the appropriate units from the database and concatenated them with signal processing to smooth the joins.

The advantage was naturalness — the raw audio was literally a human voice. The disadvantages were severe: each voice required hours of studio recording, the voice could not produce new emotions or speaking styles, and the joining artifacts (choppiness at unit boundaries) were often audible.

The most famous concatenative system was Festival, which remained the standard research TTS system for over a decade. Commercial applications used proprietary systems like Nuance Vocalizer and AT&T Natural Voices.

Era 2: Parametric Synthesis (1990s–2010s)

Parametric synthesis replaced audio snippets with a mathematical model of the vocal tract. Instead of storing recordings, the system stored statistical parameters that described how the vocal tract moved — fundamental frequency (F0), formant positions, voicing state, and spectral envelope.

At runtime, a vocoder (voice encoder/decoder) converted these parameters into audio. The most popular parametric approach was HMM-based synthesis, typified by HTS (HMM-based Speech Synthesis System). Text was converted to a context-dependent phoneme sequence, a decision tree selected the HMM state sequence, and the vocoder generated audio from the state parameters.

The advantage was flexibility — you could change speaking rate, pitch, and emphasis by adjusting parameters. The disadvantage was quality: parametric speech had the telltale “buzzy” or “muffled” quality that distinguished it from human speech.

Era 3: Neural TTS (2016–present)

Neural TTS began in 2016 with the publication of WaveNet by DeepMind. WaveNet showed that a deep autoregressive neural network could produce speech that rivaled concatenative quality, using raw waveform generation trained end-to-end on text-audio pairs.

This was the inflection point. Within three years, the research community had moved almost entirely to neural approaches. The key insight was that neural networks could learn the complex, non-linear mapping from linguistic features to acoustic features — a task that hand-crafted parametric models could only approximate.

The Modern TTS Pipeline

Most neural TTS systems in production today share a common high-level structure, even if the specific implementations vary widely.

Text → Text Normalization → G2P → Acoustic Model → Vocoder → Waveform

1. Text Normalization

Raw text is messy. Before synthesis, the system must normalize the input:

  • Expand abbreviations (“Dr.” → “Doctor”, “St.” → “Street” or “Saint”)
  • Convert numbers (“$42.50” → “forty-two dollars and fifty cents”)
  • Expand symbols (”&” → “and”, ”%” → “percent”)
  • Handle dates, times, URLs, email addresses
  • Strip formatting like Markdown or HTML tags

Text normalization is harder than it looks. Consider “I live on St. Paul St.” — the first “St.” is “Saint”, the second is “Street”. Disambiguating these requires either hand-crafted rules or, increasingly, a small language model trained for the task.

2. Grapheme-to-Phoneme Conversion

Most neural TTS systems do not read from raw characters. They first convert text to phonemes — the minimal units of sound that distinguish words in a language.

For English, “cat” becomes the phoneme sequence /k æ t/. The mapping from letters to phonemes is non-trivial: “rough” → /r ʌ f/, “though” → /ð oʊ/, “through” → /θ r uː/. Phonemization engines like espeak-ng or Misaki handle this with pronunciation dictionaries and rule-based fallbacks.

Some end-to-end models skip explicit phonemization and learn grapheme-to-speech mappings directly. This simplifies the pipeline but can introduce pronunciation errors that would be caught by a phoneme-based approach.

3. Acoustic Model

The acoustic model is the core of the system. It takes the phoneme sequence and produces an acoustic representation — typically a mel-spectrogram or a discrete speech token sequence.

The mel-spectrogram is a time-frequency representation that mimics human auditory perception. The time axis operates at approximately 50-100 frames per second, and the frequency axis uses 80-128 mel bands, spaced logarithmically to match the human ear’s non-linear frequency response.

Modern acoustic model architectures include:

  • Tacotron 2: An encoder-attention-decoder architecture with a location-sensitive attention mechanism that aligns phonemes to mel frames.

  • FastSpeech: A non-autoregressive model that predicts duration explicitly and generates mel frames in parallel using feed-forward Transformer blocks.

  • VITS: An end-to-end variational inference architecture that combines the acoustic model and vocoder into a single trained system.

  • LLM-based: Models like Orpheus and Qwen3-TTS replace the dedicated acoustic model with a causal language model that predicts speech token sequences.

4. Vocoder

The vocoder converts the acoustic representation into a raw waveform. This is a dense-to-dense generation problem: the mel-spectrogram has 80-128 values per frame at ~50 fps, while the target waveform has 24,000 values per second. The vocoder upsamples this by a factor of approximately 200-300x.

Modern neural vocoders include:

  • WaveNet: The original autoregressive vocoder that generates one sample at a time, conditioned on the mel-spectrogram. High quality but slow.

  • HiFi-GAN: A GAN-based vocoder with multi-scale and multi-period discriminators. Fast enough for real-time use on GPU.

  • BigVGAN: An improved GAN vocoder with anti-aliased multi-periodicity modeling and snake activations. It is one of several strong modern vocoder families used in research and production systems.

  • HiFT-GAN: Used in CosyVoice and Chatterbox, optimized for the 24kHz output rate with transposed convolution blocks and F0 conditioning.

End-to-End Architectures

The modular pipeline has a historical advantage — each component can be trained and optimized independently. But end-to-end models have gained traction because they eliminate information loss between stages.

VITS

VITS (Variational Inference Text-to-Speech) is a single model that takes text as input and produces waveform as output, end-to-end. It uses a conditional variational autoencoder where the latent variables represent the acoustic features, and a flow-based decoder converts them to mel-spectrograms before passing through a HiFi-GAN vocoder.

The key innovation is that the entire system is trained jointly on text-audio pairs, so there is no mismatch between training and inference — unlike the pipelined approach where each component is trained separately and the errors compound.

FastSpeech 2

FastSpeech 2 is a non-autoregressive model that predicts mel-spectrograms in parallel. It uses a duration predictor to determine how many mel frames each phoneme should occupy, and a variance adaptor that modulates pitch and energy.

The advantage of non-autoregressive generation is speed — FastSpeech 2 can generate an entire utterance in a single forward pass and is often much faster than autoregressive models like Tacotron 2.

VALL-E

Microsoft’s VALL-E was the first major TTS system built on a language modeling paradigm. It treats speech as a language: text is the prompt, and the model generates discrete audio codec tokens (from an EnCodec-style neural codec) autoregressively.

VALL-E showed that voice cloning from a 3-second reference was possible without fine-tuning, simply by conditioning the language model on the reference tokens. This insight spawned an entire family of LLM-based TTS models.

LLM-Based TTS: Speech as a Language

The biggest architectural shift in TTS over the past two years has been the adoption of language model architectures for speech generation. Instead of a dedicated acoustic model with attention mechanisms designed for alignment, these systems use a causal Transformer that predicts speech tokens the same way a text LLM predicts text tokens.

How It Works

  1. Audio tokenization: A neural codec (like EnCodec or Qwen3-TTS’s codec) converts continuous audio into discrete tokens at a reduced frame rate — typically 12-25 tokens per second, each from a codebook of 8,192 or 16,384 entries.

  2. Interleaved training: The model is trained on interleaved text and audio token sequences, learning the joint distribution of written and spoken language.

  3. Autoregressive generation: At inference, the text is fed as a prompt, and the model generates audio tokens one step at a time, with each step conditioned on both the text and previously generated audio tokens.

  4. Audio decoding: The generated token sequence is passed back through the codec decoder to produce a waveform.

Examples

  • Orpheus TTS (3B parameters): Built on a Llama backbone, generates speech tokens autoregressively with emotion tags embedded in the text.
Feature Orpheus Qwen3-TTS VALL-E 2
Parameters 3B 0.6B / 1.7B ~1B
Codec type Single-codebook HuBERT-based 12-codebook multi-codebook EnCodec
Frame rate ~50Hz 12Hz 75Hz
Voice cloning Zero-shot Zero-shot (3s) Zero-shot
Languages 1 (English) 10+ 1 (English)
Streaming Implementation-dependent Implementation-dependent Yes

Pros and Cons

LLM-based TTS can produce highly natural speech, especially for expressive or emotional content. The autoregressive nature can capture prosodic variation that non-autoregressive models may smooth out. The tradeoffs are higher computational cost, larger model sizes, and similar generation risks to text LLMs — the model can occasionally produce unexpected sounds or truncated output.

Flow Matching and Diffusion

Another major architectural family uses flow matching or diffusion to generate mel-spectrograms by iteratively denoising a random starting point.

CosyVoice and Chatterbox both use conditional flow matching in their S3Token2Mel decoders. The idea is elegant: start from Gaussian noise, then follow a learned vector field that transports the noise distribution to the target mel-spectrogram distribution.

Flow matching decoders often require multiple steps for high-quality output. Rectified flow and distillation techniques can reduce the number of steps and make real-time inference more practical on consumer hardware.

One advantage over autoregressive generation is stability — flow models are less exposed to token-by-token repetition loops. The tradeoff is that they can require more total computation through multiple forward passes, though this is increasingly mitigated by distillation.

Voice Cloning

Voice cloning is the ability to synthesize speech in a voice that was not seen during training, using only a short reference audio sample.

Speaker Adaptation vs. Zero-Shot

  • Speaker adaptation: Fine-tune the model on a small set of reference samples (30 seconds to 5 minutes). Produces high-fidelity clones but requires compute for fine-tuning.

  • Zero-shot cloning: Condition the model on a reference audio sample at inference time without any training. Lower clone fidelity but requires no per-voice compute.

How Zero-Shot Cloning Works

Modern zero-shot cloning uses three conditioning signals from the reference audio:

  1. Speech token conditioning: A neural codec tokenizes the first 3-6 seconds of reference audio. These tokens are prepended to the text tokens as a “prompt” that tells the model the acoustic characteristics of the target voice.

  2. Speaker embedding: A speaker encoder (like WeSpeaker CAMPPlus) extracts a fixed-dimensional vector representing the speaker’s identity. This is injected into the model through cross-attention or adaptive layer norm.

  3. Mel conditioning: The reference audio is converted to mel-spectrograms and used to condition the vocoder or flow decoder, providing fine-grained acoustic context.

The quality of zero-shot cloning depends heavily on the reference audio quality. Clean, 3-10 second recordings in the target language produce good results. Noisy, short, or cross-lingual references degrade clone quality.

Evaluation: How TTS Quality Is Measured

MOS (Mean Opinion Score)

The traditional metric for TTS quality is the Mean Opinion Score (MOS), where listeners rate speech samples on a 1-5 scale. Scores above 4.0 are often treated as strong, but results vary by test design, language, prompt set, and listener panel.

MOS has well-known limitations: it measures naturalness in isolation, not suitability for specific applications. A high-MOS model can still fail in production if it produces artifacts under specific conditions.

TTS Arena

The TTS Arena is one useful public benchmark for comparing TTS systems. Like the Chatbot Arena for LLMs, it runs blind side-by-side comparisons where human listeners choose which sample sounds better. The results are compiled into an Elo rating.

Public leaderboards suggest that the gap between open-weight and commercial TTS systems has narrowed, but exact rankings and Elo gaps are time-sensitive and should be checked against the current leaderboard before making production decisions.

Common Failure Modes

Issue Cause Detection
Muffled speech Vocoder artifacts, low bitrate Listening test
Pronunciation errors G2P failures, out-of-vocabulary words Test with names, acronyms
Robotic prosody Non-autoregressive smoothing, low emotion conditioning Side-by-side comparison
Click artifacts Model training instability Spectral analysis
Truncation Token limit exceeded, chunking errors Test with long text

Local vs. Cloud TTS

The choice between local and cloud TTS involves tradeoffs across several dimensions:

Factor Local Cloud
Latency No network round trip; model speed varies Network plus service latency
Privacy Can keep data on device Input/audio may be processed by a server
Cost Fixed (hardware) Per-character / per-second
Quality Depends on model and hardware Depends on provider and voice
Voice variety Limited by installed models and voices Broad provider catalogs
Offline capability Yes, if models are installed Usually no
Updates Manual model updates Automatic

Local TTS has become more practical in recent years, driven by three trends:

  1. Smaller models: Compact models such as Kokoro-82M can run locally with useful quality for many workflows.

  2. On-device hardware: Apple Silicon’s unified memory, GPUs, and dedicated accelerators in modern devices provide more compute for local inference.

  3. Open-weight licenses: Permissive model licenses can make commercial deployment simpler, though each model’s license still needs review.

The Future

Several trends are shaping the next generation of TTS:

  • Full-duplex speech: Models like PersonaPlex 7B handle both listening and speaking in a single stream, enabling natural conversational turn-taking.

  • Emotion and style control: Fine-grained control over emotional delivery, speaking rate, and emphasis is moving from research into production systems.

  • Cross-lingual voice cloning: The ability to clone a voice in one language and synthesize speech in another, preserving voice identity across languages.

  • Streaming architecture: Lower first-chunk latency makes real-time conversational use cases more practical.

  • On-device convergence: As models shrink and hardware improves, local TTS should cover more use cases that previously required cloud services.

Where TTS Is Today

The technology has reached a point where the question is no longer “does TTS work?” but “which tradeoffs fit your use case?”

If you need very expressive speech and have a GPU, LLM-based models like Orpheus or Qwen3-TTS are worth evaluating. If you need fast CPU inference with compact models, Kokoro-82M is a useful candidate. If you need voice cloning, XTTS-v2 and Chatterbox-style systems can provide zero-shot capabilities when reference audio is clean and consent is clear.

For Mac users who want local TTS without managing Python environments, model downloads, or audio pipelines, Spokio provides a native offline workflow powered by Chatterbox Turbo. It supports local voice cloning, batch export, background processing, and MP3, WAV, AIFF, and M4A export without uploading text, audio, or voice samples to cloud services.

References

More from the blog