pythonlocal ttsdeveloperson-device aitext-to-speechopen sourcekokoroorpheuspiperxtts

Local TTS with Python: A Practical Guide to Open-Source Speech Models

A practical developer guide to running local TTS models with Python — Kokoro, Orpheus, Piper, and XTTS-v2 compared with code examples, performance considerations, and deployment tradeoffs.

Updated on May 22, 202614 min read

If you have been on X recently, you have seen the same pattern: developers excited about running text-to-speech on their own machines. The models are good enough now. The tooling is catching up.

But when you actually sit down to make it work, the landscape is fragmented. There is Kokoro (82M, lightweight), Orpheus (3B, expressive, heavier), Piper (ONNX, built for speed), and XTTS-v2 (voice cloning workflows). Each has different dependencies, different APIs, and different sharp edges.

This guide walks through each one with install commands, Python snippets, and practical tradeoffs.

Prerequisites

All four models share a common base:

pip install torch torchaudio soundfile

Three of the four also need phonemization support:

# macOS
brew install espeak-ng

# Linux
sudo apt-get install espeak-ng

# Windows
# Download from https://github.com/espeak-ng/espeak-ng/releases

Set up a virtual environment before anything else:

python -m venv tts-env
source tts-env/bin/activate

1. Kokoro — Lightweight Local TTS

Kokoro helped change the conversation around local TTS. At 82 million parameters, it is small enough for practical local experiments and has ranked competitively in community TTS evaluations. Review the current model card and license before using it in a product.

The install is minimal:

pip install kokoro soundfile

And the inference code is five lines:

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code="a")  # 'a' = American English
generator = pipeline("Hello world, this is Kokoro speaking.", voice="af_heart")

for i, (_, _, audio) in enumerate(generator):
    sf.write(f"kokoro_output_{i}.wav", audio, 24000)

That is it. The KPipeline handles text normalization, grapheme-to-phoneme conversion via the Misaki G2P engine, chunking, and streaming generation.

Available Voices

Kokoro ships with voice presets as single .pt files. The voice string follows the format {language}_{voice_name}:

voices = ["af_heart", "af_bella", "af_nicole", "am_adam", "am_michael"]
for voice in voices:
    generator = pipeline("Testing voice quality.", voice=voice)
    for i, (_, _, audio) in enumerate(generator):
        sf.write(f"kokoro_{voice}_{i}.wav", audio, 24000)

Voice availability changes by release and wrapper. Check the current model card for supported voices and languages before building a product around a specific catalog.

Long Text and Chunking

Kokoro’s pipeline can handle chunking, but you should know the default behavior. The model has a context limit, and longer text should be split at natural boundaries:

long_text = "..." * 1000  # A long article

generator = pipeline(long_text, voice="af_heart")
for i, (graphemes, phonemes, audio) in enumerate(generator):
    print(f"Chunk {i}: {len(graphemes)} chars → {len(phonemes)} phonemes")
    sf.write(f"kokoro_chunk_{i}.wav", audio, 24000)

Each chunk gets its own WAV file. In a real app, you would concatenate them with soundfile or pydub.

ONNX for Even Smaller Footprint

If you need to minimize dependencies or run in a browser-like environment, ONNX variants can reduce deployment friction:

pip install kokoro-onnx
from kokoro_onnx import Kokoro
import soundfile as sf

kokoro = Kokoro("kokoro-v0_19.onnx", "voices.json")
audio, sr = kokoro.create("Hello from ONNX.", voice="af_heart", speed=1.0)
sf.write("kokoro_onnx_output.wav", audio, sr)

The tradeoff: you may need to manage model files, voice files, quantization choices, and runtime compatibility directly instead of relying on pipeline defaults.

When to Use Kokoro

Kokoro is a good starting point when you want lightweight local TTS with a simple Python path. It can be fast enough for batch processing on CPU and small enough to consider for desktop packaging. The tradeoff is that the common Kokoro workflows focus on preset voices rather than voice cloning, and the compact model limits prosodic variety compared to larger architectures.

2. Orpheus — Expressive, Heavier TTS

Orpheus TTS by Canopy AI is a 3-billion-parameter model built on a Llama backbone. It is designed for expressive speech, including laughter, sighs, and emotional inflections in supported workflows.

The key feature that sets Orpheus apart is emotion-style tags: supported workflows can embed <giggle>, <laugh>, <sigh>, and other markers directly in text.

Installation via Pip

pip install orpheus-speech

Basic Inference

from orpheus_tts import OrpheusModel
import soundfile as sf

model = OrpheusModel(model_path="canopylabs/orpheus-3b-0.1-ft")
audio = model.generate_speech(
    "I <giggle> can't believe this actually works </giggle> on my laptop.",
    voice="tara",
)
sf.write("orpheus_output.wav", audio, 24000)

The model downloads weights from Hugging Face on first run, and larger checkpoints can take several gigabytes.

Quantized for Lower VRAM

For lower-memory systems, evaluate a quantized GGUF version via LM Studio or a community local runner:

git clone https://github.com/isaiahbjork/orpheus-tts-local
cd orpheus-tts-local
pip install -r requirements.txt
python gguf_orpheus.py --text "Hello from a quantized model." --voice tara

Quantized GGUF variants can reduce memory requirements and may run on CPU if needed, but performance depends heavily on hardware and runner.

Emotion Tag Reference

Tag Effect
<giggle> Light laughter in speech
<laugh> Full laughter
<sigh> Audible sigh
<cough> Cough interjection
<sniffle> Sniffle
<groan> Groan
<yawning> Yawn
</giggle> etc. End emotion segment

Tag behavior depends on the model, runner, and prompt style. Test the exact tags you plan to expose before building UI around them.

Available Voices

Orpheus examples commonly reference voices such as tara, jess, leo, dan, mia, zac, zoe, and sam. Some workflows support reference-audio voice cloning, though quality depends heavily on the reference matching the target speaking style.

System Requirements

  • GPU minimum: depends on quantization and runner
  • GPU recommended: test FP16 workflows on your target hardware
  • CPU inference: possible in some GGUF workflows, usually much slower
  • RAM: depends on quantization, model size, and context length

When to Use Orpheus

Orpheus is for projects where expressive delivery is a priority and you have enough hardware headroom. The emotion-style tags can give you more delivery control than many smaller models. The tradeoff: it is heavier, English-focused in common examples, and the 3B parameter model has a slower startup time than smaller alternatives.

3. Piper — Fast CPU Inference

Piper is built for fast local TTS. Built on ONNX runtime, it is designed to run on resource-constrained hardware like Raspberry Pi and Home Assistant. Voice quality depends on the chosen voice and quality tier, with speed as the main advantage.

Installation

pip install piper-tts

Downloading a Voice Model

Piper distributes voices as paired .onnx and .onnx.json files:

wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx.json

Inference

import wave
from piper import PiperVoice

voice = PiperVoice.load("en_US-amy-medium.onnx")

with wave.open("piper_output.wav", "w") as wav:
    voice.synthesize("Piper is fast because it runs on ONNX.", wav)

Streaming Audio

Piper supports chunked generation for real-time applications:

audio_stream = voice.synthesize_stream_raw("Long text that streams in real time.")

for chunk in audio_stream:
    # Each chunk is raw PCM audio data
    process_audio_chunk(chunk)

Voice Catalog

Piper offers voices in multiple quality tiers across many languages:

  • Low (~10MB): Fastest, lowest quality
  • Medium (~50MB): Balanced speed and quality
  • High (~200MB): Highest tier in the catalog, still CPU-oriented

Browse the full catalog at huggingface.co/rhasspy/piper-voices.

When to Use Piper

Piper is useful for embedded systems, real-time streaming where latency matters more than absolute quality, and applications needing broad language coverage without large model downloads. If you need highly natural voices for user-facing content, compare Piper directly against Kokoro, Orpheus, and commercial options with your own text.

4. XTTS-v2 — Voice Cloning from 3 Seconds

Coqui TTS shut down in 2024, but the Idiap Research Institute maintains a community fork. XTTS-v2 remains a common option for local voice cloning workflows; it can use short reference samples and supports multiple languages.

Installation

pip install coqui-tts

(Optional: add torch and torchaudio if they are not already installed.)

Basic Inference

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

tts.tts_to_file(
    text="This voice was cloned from a three second sample.",
    speaker_wav="reference.wav",  # 3-10 second reference audio
    language="en",
    file_path="xtts_output.wav",
)

Voice Cloning from a Microphone

For real-time cloning, capture audio directly:

import sounddevice as sd
import soundfile as sf
from TTS.api import TTS

# Record 5 seconds from mic
audio = sd.rec(int(5 * 24000), samplerate=24000, channels=1)
sd.wait()
sf.write("mic_ref.wav", audio, 24000)

# Clone voice
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.tts_to_file(
    text="I just cloned my own voice from a microphone recording.",
    speaker_wav="mic_ref.wav",
    language="en",
    file_path="xtts_mic_clone.wav",
)

Streaming Inference

XTTS-v2 supports streaming workflows in some configurations:

from TTS.tts.models.xtts import Xtts
from huggingface_hub import snapshot_download

model_path = snapshot_download("coqui/XTTS-v2")
model = Xtts.init_from_config(...)
model.load_checkpoint(...)

for chunk in model.inference_stream("Long text...", "reference.wav", "en"):
    process_audio_chunk(chunk)

Cross-Lingual Cloning

The notable feature: use a reference voice and synthesize speech in another supported language:

tts.tts_to_file(
    text="Bonjour, je parle français avec une voix anglaise clonée.",
    speaker_wav="english_reference.wav",
    language="fr",
    file_path="xtts_crosslingual.wav",
)

The voice character can carry over, but accent and pronunciation quality vary with the reference audio, language, and model settings.

Supported Languages

Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

System Requirements

  • GPU minimum: depends on model version and runtime
  • GPU recommended: test on your target hardware
  • CPU: possible in some setups, usually very slow
  • RAM: depends on model version, batch size, and reference audio

When to Use XTTS-v2

XTTS-v2 is worth evaluating when you need local voice cloning without fine-tuning. The tradeoffs are significant: it is heavier than small TTS models, GPU support matters, and the Coqui ecosystem has shifted to community maintenance.

Quick Reference Comparison

Feature Kokoro 82M Orpheus 3B Piper XTTS-v2
Min GPU VRAM CPU-friendly Depends on quantization CPU-friendly GPU recommended
CPU Inference Yes Possible in some runners Yes Possible but slow
Voice Cloning Common workflows: no Workflow-dependent No Yes
Languages Check current model card English-focused examples Broad catalog Multiple languages
License Review current license Review current license Review current license Review current license
Model Size Small Large Small to medium Medium to large
Quality Strong for size Expressive Voice-dependent Strong cloning workflows
Good Fit CPU / edge Expressive dialogue Speed / RPi Voice cloning

Performance Considerations

Performance depends on hardware, quantization, runtime, text length, and whether the model is already warm. Use the table below as a testing plan rather than a benchmark claim:

Model What to Measure Expected Pattern
Kokoro (CPU) First load, chunk speed, memory Good lightweight baseline
Kokoro (ONNX CPU) Runtime setup, quantization quality Smaller deployment surface
Orpheus (GPU) Startup time, memory, tag behavior Stronger expression, heavier runtime
Orpheus (Q4 CPU) CPU feasibility Slower, useful for experiments
Piper (CPU) Latency, streaming, voice quality Fast CPU path
XTTS-v2 (GPU) Reference quality, first chunk, total time Useful cloning, heavier runtime

Real-Time Factor = audio duration / wall time. Above 1.0 means faster than real-time.

Building a Real Application

A production local TTS app needs more than model inference. Here is what the surrounding system should handle:

Text Preprocessing

Raw text contains punctuation, numbers, URLs, code identifiers, and Markdown syntax. Strip or normalize before sending to the model:

import re

def preprocess_text(text: str) -> str:
    """Normalize text for TTS input."""
    # Strip Markdown links
    text = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", text)
    # Remove code blocks
    text = re.sub(r"```.*?```", "", text, flags=re.DOTALL)
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

Chunking Large Documents

Models have token limits. Split text at natural boundaries:

def chunk_text(text: str, max_chars: int = 2000) -> list[str]:
    """Split text into TTS-friendly chunks."""
    sentences = re.split(r"(?<=[.!?])\s+", text)
    chunks, current = [], []
    length = 0

    for sentence in sentences:
        if length + len(sentence) > max_chars and current:
            chunks.append(" ".join(current))
            current, length = [], 0
        current.append(sentence)
        length += len(sentence)

    if current:
        chunks.append(" ".join(current))
    return chunks

Audio Concatenation

Join chunked outputs into a single file:

import numpy as np
import soundfile as sf

def concatenate_audio(audio_files: list[str], output: str, sr: int = 24000):
    """Merge multiple WAV files into one."""
    combined = np.concatenate([sf.read(f)[0] for f in audio_files])
    sf.write(output, combined, sr)

Queue Management

For batch processing, implement a simple queue with progress tracking:

from dataclasses import dataclass
from queue import Queue
import soundfile as sf

@dataclass
class TTSJob:
    text: str
    voice: str
    output_path: str

class TTSQueue:
    def __init__(self, pipeline):
        self.queue = Queue()
        self.pipeline = pipeline

    def worker(self):
        while True:
            job = self.queue.get()
            generator = self.pipeline(job.text, voice=job.voice)
            for i, (_, _, audio) in enumerate(generator):
                sf.write(
                    f"{job.output_path}_chunk_{i}.wav",
                    audio, 24000
                )
            self.queue.task_done()

    def submit(self, job: TTSJob):
        self.queue.put(job)

Which Model Should You Start With?

The answer depends on your hardware and use case:

  • You have a typical laptop and want to try TTS quickly: Start with Kokoro. The install is simple, the code is short, and it is a practical lightweight baseline.

  • You have a modern GPU and want expressive output: Try Orpheus. The emotion-style tags are worth testing if delivery matters.

  • You need broad language coverage or fast CPU inference: Try Piper. It may not be the most natural-sounding model for every use case, but it is a strong speed-oriented baseline.

  • You need voice cloning for a specific voice: Evaluate XTTS-v2 and other current cloning workflows. Plan for consent, licensing, hardware requirements, and larger model downloads.

If you want local TTS on Mac without managing Python dependencies, Spokio is a native Mac app powered by Chatterbox Turbo. It supports local voice cloning, batch export, MP3/WAV/AIFF/M4A output, Apple Silicon and Intel Macs, and no cloud uploads for text, audio, or voice samples. But if you want to understand how models work under the hood, running them yourself in Python is a good way to learn.

References

More from the blog