If you have been on X recently, you have seen the same pattern: developers excited about running text-to-speech on their own machines. The models are good enough now. The tooling is catching up.
But when you actually sit down to make it work, the landscape is fragmented. There is Kokoro (82M, lightweight), Orpheus (3B, expressive, heavier), Piper (ONNX, built for speed), and XTTS-v2 (voice cloning workflows). Each has different dependencies, different APIs, and different sharp edges.
This guide walks through each one with install commands, Python snippets, and practical tradeoffs.
Prerequisites
All four models share a common base:
pip install torch torchaudio soundfileThree of the four also need phonemization support:
# macOS
brew install espeak-ng
# Linux
sudo apt-get install espeak-ng
# Windows
# Download from https://github.com/espeak-ng/espeak-ng/releasesSet up a virtual environment before anything else:
python -m venv tts-env
source tts-env/bin/activate1. Kokoro — Lightweight Local TTS
Kokoro helped change the conversation around local TTS. At 82 million parameters, it is small enough for practical local experiments and has ranked competitively in community TTS evaluations. Review the current model card and license before using it in a product.
The install is minimal:
pip install kokoro soundfileAnd the inference code is five lines:
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code="a") # 'a' = American English
generator = pipeline("Hello world, this is Kokoro speaking.", voice="af_heart")
for i, (_, _, audio) in enumerate(generator):
sf.write(f"kokoro_output_{i}.wav", audio, 24000)That is it. The KPipeline handles text normalization, grapheme-to-phoneme conversion via the Misaki G2P engine, chunking, and streaming generation.
Available Voices
Kokoro ships with voice presets as single .pt files. The voice string follows the format {language}_{voice_name}:
voices = ["af_heart", "af_bella", "af_nicole", "am_adam", "am_michael"]
for voice in voices:
generator = pipeline("Testing voice quality.", voice=voice)
for i, (_, _, audio) in enumerate(generator):
sf.write(f"kokoro_{voice}_{i}.wav", audio, 24000)Voice availability changes by release and wrapper. Check the current model card for supported voices and languages before building a product around a specific catalog.
Long Text and Chunking
Kokoro’s pipeline can handle chunking, but you should know the default behavior. The model has a context limit, and longer text should be split at natural boundaries:
long_text = "..." * 1000 # A long article
generator = pipeline(long_text, voice="af_heart")
for i, (graphemes, phonemes, audio) in enumerate(generator):
print(f"Chunk {i}: {len(graphemes)} chars → {len(phonemes)} phonemes")
sf.write(f"kokoro_chunk_{i}.wav", audio, 24000)Each chunk gets its own WAV file. In a real app, you would concatenate them with soundfile or pydub.
ONNX for Even Smaller Footprint
If you need to minimize dependencies or run in a browser-like environment, ONNX variants can reduce deployment friction:
pip install kokoro-onnxfrom kokoro_onnx import Kokoro
import soundfile as sf
kokoro = Kokoro("kokoro-v0_19.onnx", "voices.json")
audio, sr = kokoro.create("Hello from ONNX.", voice="af_heart", speed=1.0)
sf.write("kokoro_onnx_output.wav", audio, sr)The tradeoff: you may need to manage model files, voice files, quantization choices, and runtime compatibility directly instead of relying on pipeline defaults.
When to Use Kokoro
Kokoro is a good starting point when you want lightweight local TTS with a simple Python path. It can be fast enough for batch processing on CPU and small enough to consider for desktop packaging. The tradeoff is that the common Kokoro workflows focus on preset voices rather than voice cloning, and the compact model limits prosodic variety compared to larger architectures.
2. Orpheus — Expressive, Heavier TTS
Orpheus TTS by Canopy AI is a 3-billion-parameter model built on a Llama backbone. It is designed for expressive speech, including laughter, sighs, and emotional inflections in supported workflows.
The key feature that sets Orpheus apart is emotion-style tags: supported workflows can embed <giggle>, <laugh>, <sigh>, and other markers directly in text.
Installation via Pip
pip install orpheus-speechBasic Inference
from orpheus_tts import OrpheusModel
import soundfile as sf
model = OrpheusModel(model_path="canopylabs/orpheus-3b-0.1-ft")
audio = model.generate_speech(
"I <giggle> can't believe this actually works </giggle> on my laptop.",
voice="tara",
)
sf.write("orpheus_output.wav", audio, 24000)The model downloads weights from Hugging Face on first run, and larger checkpoints can take several gigabytes.
Quantized for Lower VRAM
For lower-memory systems, evaluate a quantized GGUF version via LM Studio or a community local runner:
git clone https://github.com/isaiahbjork/orpheus-tts-local
cd orpheus-tts-local
pip install -r requirements.txt
python gguf_orpheus.py --text "Hello from a quantized model." --voice taraQuantized GGUF variants can reduce memory requirements and may run on CPU if needed, but performance depends heavily on hardware and runner.
Emotion Tag Reference
| Tag | Effect |
|---|---|
<giggle> |
Light laughter in speech |
<laugh> |
Full laughter |
<sigh> |
Audible sigh |
<cough> |
Cough interjection |
<sniffle> |
Sniffle |
<groan> |
Groan |
<yawning> |
Yawn |
</giggle> etc. |
End emotion segment |
Tag behavior depends on the model, runner, and prompt style. Test the exact tags you plan to expose before building UI around them.
Available Voices
Orpheus examples commonly reference voices such as tara, jess, leo, dan, mia, zac, zoe, and sam. Some workflows support reference-audio voice cloning, though quality depends heavily on the reference matching the target speaking style.
System Requirements
- GPU minimum: depends on quantization and runner
- GPU recommended: test FP16 workflows on your target hardware
- CPU inference: possible in some GGUF workflows, usually much slower
- RAM: depends on quantization, model size, and context length
When to Use Orpheus
Orpheus is for projects where expressive delivery is a priority and you have enough hardware headroom. The emotion-style tags can give you more delivery control than many smaller models. The tradeoff: it is heavier, English-focused in common examples, and the 3B parameter model has a slower startup time than smaller alternatives.
3. Piper — Fast CPU Inference
Piper is built for fast local TTS. Built on ONNX runtime, it is designed to run on resource-constrained hardware like Raspberry Pi and Home Assistant. Voice quality depends on the chosen voice and quality tier, with speed as the main advantage.
Installation
pip install piper-ttsDownloading a Voice Model
Piper distributes voices as paired .onnx and .onnx.json files:
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx.jsonInference
import wave
from piper import PiperVoice
voice = PiperVoice.load("en_US-amy-medium.onnx")
with wave.open("piper_output.wav", "w") as wav:
voice.synthesize("Piper is fast because it runs on ONNX.", wav)Streaming Audio
Piper supports chunked generation for real-time applications:
audio_stream = voice.synthesize_stream_raw("Long text that streams in real time.")
for chunk in audio_stream:
# Each chunk is raw PCM audio data
process_audio_chunk(chunk)Voice Catalog
Piper offers voices in multiple quality tiers across many languages:
- Low (~10MB): Fastest, lowest quality
- Medium (~50MB): Balanced speed and quality
- High (~200MB): Highest tier in the catalog, still CPU-oriented
Browse the full catalog at huggingface.co/rhasspy/piper-voices.
When to Use Piper
Piper is useful for embedded systems, real-time streaming where latency matters more than absolute quality, and applications needing broad language coverage without large model downloads. If you need highly natural voices for user-facing content, compare Piper directly against Kokoro, Orpheus, and commercial options with your own text.
4. XTTS-v2 — Voice Cloning from 3 Seconds
Coqui TTS shut down in 2024, but the Idiap Research Institute maintains a community fork. XTTS-v2 remains a common option for local voice cloning workflows; it can use short reference samples and supports multiple languages.
Installation
pip install coqui-tts(Optional: add torch and torchaudio if they are not already installed.)
Basic Inference
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.tts_to_file(
text="This voice was cloned from a three second sample.",
speaker_wav="reference.wav", # 3-10 second reference audio
language="en",
file_path="xtts_output.wav",
)Voice Cloning from a Microphone
For real-time cloning, capture audio directly:
import sounddevice as sd
import soundfile as sf
from TTS.api import TTS
# Record 5 seconds from mic
audio = sd.rec(int(5 * 24000), samplerate=24000, channels=1)
sd.wait()
sf.write("mic_ref.wav", audio, 24000)
# Clone voice
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.tts_to_file(
text="I just cloned my own voice from a microphone recording.",
speaker_wav="mic_ref.wav",
language="en",
file_path="xtts_mic_clone.wav",
)Streaming Inference
XTTS-v2 supports streaming workflows in some configurations:
from TTS.tts.models.xtts import Xtts
from huggingface_hub import snapshot_download
model_path = snapshot_download("coqui/XTTS-v2")
model = Xtts.init_from_config(...)
model.load_checkpoint(...)
for chunk in model.inference_stream("Long text...", "reference.wav", "en"):
process_audio_chunk(chunk)Cross-Lingual Cloning
The notable feature: use a reference voice and synthesize speech in another supported language:
tts.tts_to_file(
text="Bonjour, je parle français avec une voix anglaise clonée.",
speaker_wav="english_reference.wav",
language="fr",
file_path="xtts_crosslingual.wav",
)The voice character can carry over, but accent and pronunciation quality vary with the reference audio, language, and model settings.
Supported Languages
Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
System Requirements
- GPU minimum: depends on model version and runtime
- GPU recommended: test on your target hardware
- CPU: possible in some setups, usually very slow
- RAM: depends on model version, batch size, and reference audio
When to Use XTTS-v2
XTTS-v2 is worth evaluating when you need local voice cloning without fine-tuning. The tradeoffs are significant: it is heavier than small TTS models, GPU support matters, and the Coqui ecosystem has shifted to community maintenance.
Quick Reference Comparison
| Feature | Kokoro 82M | Orpheus 3B | Piper | XTTS-v2 |
|---|---|---|---|---|
| Min GPU VRAM | CPU-friendly | Depends on quantization | CPU-friendly | GPU recommended |
| CPU Inference | Yes | Possible in some runners | Yes | Possible but slow |
| Voice Cloning | Common workflows: no | Workflow-dependent | No | Yes |
| Languages | Check current model card | English-focused examples | Broad catalog | Multiple languages |
| License | Review current license | Review current license | Review current license | Review current license |
| Model Size | Small | Large | Small to medium | Medium to large |
| Quality | Strong for size | Expressive | Voice-dependent | Strong cloning workflows |
| Good Fit | CPU / edge | Expressive dialogue | Speed / RPi | Voice cloning |
Performance Considerations
Performance depends on hardware, quantization, runtime, text length, and whether the model is already warm. Use the table below as a testing plan rather than a benchmark claim:
| Model | What to Measure | Expected Pattern |
|---|---|---|
| Kokoro (CPU) | First load, chunk speed, memory | Good lightweight baseline |
| Kokoro (ONNX CPU) | Runtime setup, quantization quality | Smaller deployment surface |
| Orpheus (GPU) | Startup time, memory, tag behavior | Stronger expression, heavier runtime |
| Orpheus (Q4 CPU) | CPU feasibility | Slower, useful for experiments |
| Piper (CPU) | Latency, streaming, voice quality | Fast CPU path |
| XTTS-v2 (GPU) | Reference quality, first chunk, total time | Useful cloning, heavier runtime |
Real-Time Factor = audio duration / wall time. Above 1.0 means faster than real-time.
Building a Real Application
A production local TTS app needs more than model inference. Here is what the surrounding system should handle:
Text Preprocessing
Raw text contains punctuation, numbers, URLs, code identifiers, and Markdown syntax. Strip or normalize before sending to the model:
import re
def preprocess_text(text: str) -> str:
"""Normalize text for TTS input."""
# Strip Markdown links
text = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", text)
# Remove code blocks
text = re.sub(r"```.*?```", "", text, flags=re.DOTALL)
# Normalize whitespace
text = re.sub(r"\s+", " ", text).strip()
return textChunking Large Documents
Models have token limits. Split text at natural boundaries:
def chunk_text(text: str, max_chars: int = 2000) -> list[str]:
"""Split text into TTS-friendly chunks."""
sentences = re.split(r"(?<=[.!?])\s+", text)
chunks, current = [], []
length = 0
for sentence in sentences:
if length + len(sentence) > max_chars and current:
chunks.append(" ".join(current))
current, length = [], 0
current.append(sentence)
length += len(sentence)
if current:
chunks.append(" ".join(current))
return chunksAudio Concatenation
Join chunked outputs into a single file:
import numpy as np
import soundfile as sf
def concatenate_audio(audio_files: list[str], output: str, sr: int = 24000):
"""Merge multiple WAV files into one."""
combined = np.concatenate([sf.read(f)[0] for f in audio_files])
sf.write(output, combined, sr)Queue Management
For batch processing, implement a simple queue with progress tracking:
from dataclasses import dataclass
from queue import Queue
import soundfile as sf
@dataclass
class TTSJob:
text: str
voice: str
output_path: str
class TTSQueue:
def __init__(self, pipeline):
self.queue = Queue()
self.pipeline = pipeline
def worker(self):
while True:
job = self.queue.get()
generator = self.pipeline(job.text, voice=job.voice)
for i, (_, _, audio) in enumerate(generator):
sf.write(
f"{job.output_path}_chunk_{i}.wav",
audio, 24000
)
self.queue.task_done()
def submit(self, job: TTSJob):
self.queue.put(job)Which Model Should You Start With?
The answer depends on your hardware and use case:
-
You have a typical laptop and want to try TTS quickly: Start with Kokoro. The install is simple, the code is short, and it is a practical lightweight baseline.
-
You have a modern GPU and want expressive output: Try Orpheus. The emotion-style tags are worth testing if delivery matters.
-
You need broad language coverage or fast CPU inference: Try Piper. It may not be the most natural-sounding model for every use case, but it is a strong speed-oriented baseline.
-
You need voice cloning for a specific voice: Evaluate XTTS-v2 and other current cloning workflows. Plan for consent, licensing, hardware requirements, and larger model downloads.
If you want local TTS on Mac without managing Python dependencies, Spokio is a native Mac app powered by Chatterbox Turbo. It supports local voice cloning, batch export, MP3/WAV/AIFF/M4A output, Apple Silicon and Intel Macs, and no cloud uploads for text, audio, or voice samples. But if you want to understand how models work under the hood, running them yourself in Python is a good way to learn.
