Try Spokio Free

Published Jun 02, 2026

TTS Concepts

Pipeline and Architecture

TTS Pipeline (End-to-End) — How text flows through G2P, acoustic model, and vocoder.
Formant vs Parametric vs Neural TTS — The three generations of speech synthesis.
Autoregressive vs Non-Autoregressive TTS — Two neural architectures with different speed and quality tradeoffs.
Diffusion Models for TTS — The latest architecture paradigm in speech synthesis.
Attention and Alignment — How TTS models learn to align text to audio.

Text and Pronunciation

G2P (Grapheme-to-Phoneme) — Converting written text into phonetic representations.
Text Tokenization — How raw text is split into tokens for TTS models.
SSML (Speech Synthesis Markup Language) — XML markup for controlling TTS output.

Voice and Quality

Prosody — The rhythm, stress, and intonation of synthetic speech.
Voice Cloning and Speaker Embedding — Replicating a voice from audio samples.
Zero-Shot vs Few-Shot vs Fine-Tuned TTS — Three approaches to voice adaptation.
Emotion and Expressiveness — How TTS handles emotional delivery.

Signal Processing

Mel-Spectrograms — The frequency representation used by neural TTS models.
Vocoder — Converting acoustic features into audio waveforms.

Generation and Performance

Duration Prediction — How TTS determines phoneme timing and pacing.
Chunking — Splitting long text into segments for TTS generation.
Streaming TTS — Generating speech incrementally for real-time applications.
Real-Time Factor (RTF) — The standard metric for TTS generation speed.
MOS (Mean Opinion Score) — How TTS voice quality is measured.

Try Spokio for Mac.

Offline text-to-speech for Mac. Local voice cloning, batch export, and no cloud uploads for your text, audio, or voice samples.

Download on the Mac App Store

macOS 15.6+ | Apple Silicon & Intel | English only

© 2026 Spokio