Local TTS with Python: A Practical Guide to Open-Source Speech Models

If you have been on X recently, you have seen the same pattern: developers excited about running text-to-speech on their own machines. The models are good enough now. The tooling is catching up.

But when you actually sit down to make it work, the landscape is fragmented. There is Kokoro (82M, lightweight), Orpheus (3B, expressive, heavier), Piper (ONNX, built for speed), and XTTS-v2 (voice cloning workflows). Each has different dependencies, different APIs, and different sharp edges.

This guide walks through each one with install commands, Python snippets, and practical tradeoffs.

Prerequisites

All four models share a common base:

pip install torch torchaudio soundfile

Three of the four also need phonemization support:

# macOS
brew install espeak-ng

# Linux
sudo apt-get install espeak-ng

# Windows
# Download from https://github.com/espeak-ng/espeak-ng/releases

Set up a virtual environment before anything else:

python -m venv tts-env
source tts-env/bin/activate

1. Kokoro — Lightweight Local TTS

Kokoro helped change the conversation around local TTS. At 82 million parameters, it is small enough for practical local experiments and has ranked competitively in community TTS evaluations. Review the current model card and license before using it in a product.

The install is minimal:

pip install kokoro soundfile

And the inference code is five lines:

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code="a")  # 'a' = American English
generator = pipeline("Hello world, this is Kokoro speaking.", voice="af_heart")

for i, (_, _, audio) in enumerate(generator):
    sf.write(f"kokoro_output_{i}.wav", audio, 24000)

That is it. The KPipeline handles text normalization, grapheme-to-phoneme conversion via the Misaki G2P engine, chunking, and streaming generation.

Available Voices

Kokoro ships with voice presets as single .pt files. The voice string follows the format {language}_{voice_name}:

voices = ["af_heart", "af_bella", "af_nicole", "am_adam", "am_michael"]
for voice in voices:
    generator = pipeline("Testing voice quality.", voice=voice)
    for i, (_, _, audio) in enumerate(generator):
        sf.write(f"kokoro_{voice}_{i}.wav", audio, 24000)

Voice availability changes by release and wrapper. Check the current model card for supported voices and languages before building a product around a specific catalog.

Long Text and Chunking

Kokoro’s pipeline can handle chunking, but you should know the default behavior. The model has a context limit, and longer text should be split at natural boundaries:

long_text = "..." * 1000  # A long article

generator = pipeline(long_text, voice="af_heart")
for i, (graphemes, phonemes, audio) in enumerate(generator):
    print(f"Chunk {i}: {len(graphemes)} chars → {len(phonemes)} phonemes")
    sf.write(f"kokoro_chunk_{i}.wav", audio, 24000)

Each chunk gets its own WAV file. In a real app, you would concatenate them with soundfile or pydub.

ONNX for Even Smaller Footprint

If you need to minimize dependencies or run in a browser-like environment, ONNX variants can reduce deployment friction:

pip install kokoro-onnx

from kokoro_onnx import Kokoro
import soundfile as sf

kokoro = Kokoro("kokoro-v0_19.onnx", "voices.json")
audio, sr = kokoro.create("Hello from ONNX.", voice="af_heart", speed=1.0)
sf.write("kokoro_onnx_output.wav", audio, sr)

The tradeoff: you may need to manage model files, voice files, quantization choices, and runtime compatibility directly instead of relying on pipeline defaults.

When to Use Kokoro

Kokoro is a good starting point when you want lightweight local TTS with a simple Python path. It can be fast enough for batch processing on CPU and small enough to consider for desktop packaging. The tradeoff is that the common Kokoro workflows focus on preset voices rather than voice cloning, and the compact model limits prosodic variety compared to larger architectures.

2. Orpheus — Expressive, Heavier TTS

Orpheus TTS by Canopy AI is a 3-billion-parameter model built on a Llama backbone. It is designed for expressive speech, including laughter, sighs, and emotional inflections in supported workflows.

The key feature that sets Orpheus apart is emotion-style tags: supported workflows can embed <giggle>, <laugh>, <sigh>, and other markers directly in text.

Installation via Pip

pip install orpheus-speech

Basic Inference

from orpheus_tts import OrpheusModel
import soundfile as sf

model = OrpheusModel(model_path="canopylabs/orpheus-3b-0.1-ft")
audio = model.generate_speech(
    "I <giggle> can't believe this actually works </giggle> on my laptop.",
    voice="tara",
)
sf.write("orpheus_output.wav", audio, 24000)

The model downloads weights from Hugging Face on first run, and larger checkpoints can take several gigabytes.

Quantized for Lower VRAM

For lower-memory systems, evaluate a quantized GGUF version via LM Studio or a community local runner:

git clone https://github.com/isaiahbjork/orpheus-tts-local
cd orpheus-tts-local
pip install -r requirements.txt
python gguf_orpheus.py --text "Hello from a quantized model." --voice tara

Quantized GGUF variants can reduce memory requirements and may run on CPU if needed, but performance depends heavily on hardware and runner.

Emotion Tag Reference

Tag	Effect
`<giggle>`	Light laughter in speech
`<laugh>`	Full laughter
`<sigh>`	Audible sigh
`<cough>`	Cough interjection
`<sniffle>`	Sniffle
`<groan>`	Groan
`<yawning>`	Yawn
`</giggle>` etc.	End emotion segment

Tag behavior depends on the model, runner, and prompt style. Test the exact tags you plan to expose before building UI around them.

Available Voices

Orpheus examples commonly reference voices such as tara, jess, leo, dan, mia, zac, zoe, and sam. Some workflows support reference-audio voice cloning, though quality depends heavily on the reference matching the target speaking style.

System Requirements

GPU minimum: depends on quantization and runner
GPU recommended: test FP16 workflows on your target hardware
CPU inference: possible in some GGUF workflows, usually much slower
RAM: depends on quantization, model size, and context length

When to Use Orpheus

Orpheus is for projects where expressive delivery is a priority and you have enough hardware headroom. The emotion-style tags can give you more delivery control than many smaller models. The tradeoff: it is heavier, English-focused in common examples, and the 3B parameter model has a slower startup time than smaller alternatives.

3. Piper — Fast CPU Inference

Piper is built for fast local TTS. Built on ONNX runtime, it is designed to run on resource-constrained hardware like Raspberry Pi and Home Assistant. Voice quality depends on the chosen voice and quality tier, with speed as the main advantage.

Installation

pip install piper-tts

Downloading a Voice Model

Piper distributes voices as paired .onnx and .onnx.json files:

wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx.json

Inference

import wave
from piper import PiperVoice

voice = PiperVoice.load("en_US-amy-medium.onnx")

with wave.open("piper_output.wav", "w") as wav:
    voice.synthesize("Piper is fast because it runs on ONNX.", wav)

Streaming Audio

Piper supports chunked generation for real-time applications:

audio_stream = voice.synthesize_stream_raw("Long text that streams in real time.")

for chunk in audio_stream:
    # Each chunk is raw PCM audio data
    process_audio_chunk(chunk)

Voice Catalog

Piper offers voices in multiple quality tiers across many languages:

Low (~10MB): Fastest, lowest quality
Medium (~50MB): Balanced speed and quality
High (~200MB): Highest tier in the catalog, still CPU-oriented

Browse the full catalog at huggingface.co/rhasspy/piper-voices.

When to Use Piper

Piper is useful for embedded systems, real-time streaming where latency matters more than absolute quality, and applications needing broad language coverage without large model downloads. If you need highly natural voices for user-facing content, compare Piper directly against Kokoro, Orpheus, and commercial options with your own text.

4. XTTS-v2 — Voice Cloning from 3 Seconds

Coqui TTS shut down in 2024, but the Idiap Research Institute maintains a community fork. XTTS-v2 remains a common option for local voice cloning workflows; it can use short reference samples and supports multiple languages.

Installation

pip install coqui-tts

(Optional: add torch and torchaudio if they are not already installed.)

Basic Inference

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

tts.tts_to_file(
    text="This voice was cloned from a three second sample.",
    speaker_wav="reference.wav",  # 3-10 second reference audio
    language="en",
    file_path="xtts_output.wav",
)

Voice Cloning from a Microphone

For real-time cloning, capture audio directly:

import sounddevice as sd
import soundfile as sf
from TTS.api import TTS

# Record 5 seconds from mic
audio = sd.rec(int(5 * 24000), samplerate=24000, channels=1)
sd.wait()
sf.write("mic_ref.wav", audio, 24000)

# Clone voice
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.tts_to_file(
    text="I just cloned my own voice from a microphone recording.",
    speaker_wav="mic_ref.wav",
    language="en",
    file_path="xtts_mic_clone.wav",
)

Streaming Inference

XTTS-v2 supports streaming workflows in some configurations:

from TTS.tts.models.xtts import Xtts
from huggingface_hub import snapshot_download

model_path = snapshot_download("coqui/XTTS-v2")
model = Xtts.init_from_config(...)
model.load_checkpoint(...)

for chunk in model.inference_stream("Long text...", "reference.wav", "en"):
    process_audio_chunk(chunk)

Cross-Lingual Cloning

The notable feature: use a reference voice and synthesize speech in another supported language:

tts.tts_to_file(
    text="Bonjour, je parle français avec une voix anglaise clonée.",
    speaker_wav="english_reference.wav",
    language="fr",
    file_path="xtts_crosslingual.wav",
)

The voice character can carry over, but accent and pronunciation quality vary with the reference audio, language, and model settings.

Supported Languages

Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

System Requirements

GPU minimum: depends on model version and runtime
GPU recommended: test on your target hardware
CPU: possible in some setups, usually very slow
RAM: depends on model version, batch size, and reference audio

When to Use XTTS-v2

XTTS-v2 is worth evaluating when you need local voice cloning without fine-tuning. The tradeoffs are significant: it is heavier than small TTS models, GPU support matters, and the Coqui ecosystem has shifted to community maintenance.

Quick Reference Comparison

Feature	Kokoro 82M	Orpheus 3B	Piper	XTTS-v2
Min GPU VRAM	CPU-friendly	Depends on quantization	CPU-friendly	GPU recommended
CPU Inference	Yes	Possible in some runners	Yes	Possible but slow
Voice Cloning	Common workflows: no	Workflow-dependent	No	Yes
Languages	Check current model card	English-focused examples	Broad catalog	Multiple languages
License	Review current license	Review current license	Review current license	Review current license
Model Size	Small	Large	Small to medium	Medium to large
Quality	Strong for size	Expressive	Voice-dependent	Strong cloning workflows
Good Fit	CPU / edge	Expressive dialogue	Speed / RPi	Voice cloning

Performance Considerations

Performance depends on hardware, quantization, runtime, text length, and whether the model is already warm. Use the table below as a testing plan rather than a benchmark claim:

Model	What to Measure	Expected Pattern
Kokoro (CPU)	First load, chunk speed, memory	Good lightweight baseline
Kokoro (ONNX CPU)	Runtime setup, quantization quality	Smaller deployment surface
Orpheus (GPU)	Startup time, memory, tag behavior	Stronger expression, heavier runtime
Orpheus (Q4 CPU)	CPU feasibility	Slower, useful for experiments
Piper (CPU)	Latency, streaming, voice quality	Fast CPU path
XTTS-v2 (GPU)	Reference quality, first chunk, total time	Useful cloning, heavier runtime

Real-Time Factor = audio duration / wall time. Above 1.0 means faster than real-time.

Building a Real Application

A production local TTS app needs more than model inference. Here is what the surrounding system should handle:

Text Preprocessing

Raw text contains punctuation, numbers, URLs, code identifiers, and Markdown syntax. Strip or normalize before sending to the model:

import re

def preprocess_text(text: str) -> str:
    """Normalize text for TTS input."""
    # Strip Markdown links
    text = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", text)
    # Remove code blocks
    text = re.sub(r"```.*?```", "", text, flags=re.DOTALL)
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

Chunking Large Documents

Models have token limits. Split text at natural boundaries:

def chunk_text(text: str, max_chars: int = 2000) -> list[str]:
    """Split text into TTS-friendly chunks."""
    sentences = re.split(r"(?<=[.!?])\s+", text)
    chunks, current = [], []
    length = 0

    for sentence in sentences:
        if length + len(sentence) > max_chars and current:
            chunks.append(" ".join(current))
            current, length = [], 0
        current.append(sentence)
        length += len(sentence)

    if current:
        chunks.append(" ".join(current))
    return chunks

Audio Concatenation

Join chunked outputs into a single file:

import numpy as np
import soundfile as sf

def concatenate_audio(audio_files: list[str], output: str, sr: int = 24000):
    """Merge multiple WAV files into one."""
    combined = np.concatenate([sf.read(f)[0] for f in audio_files])
    sf.write(output, combined, sr)

Queue Management

For batch processing, implement a simple queue with progress tracking:

from dataclasses import dataclass
from queue import Queue
import soundfile as sf

@dataclass
class TTSJob:
    text: str
    voice: str
    output_path: str

class TTSQueue:
    def __init__(self, pipeline):
        self.queue = Queue()
        self.pipeline = pipeline

    def worker(self):
        while True:
            job = self.queue.get()
            generator = self.pipeline(job.text, voice=job.voice)
            for i, (_, _, audio) in enumerate(generator):
                sf.write(
                    f"{job.output_path}_chunk_{i}.wav",
                    audio, 24000
                )
            self.queue.task_done()

    def submit(self, job: TTSJob):
        self.queue.put(job)

Which Model Should You Start With?

The answer depends on your hardware and use case:

You have a typical laptop and want to try TTS quickly: Start with Kokoro. The install is simple, the code is short, and it is a practical lightweight baseline.
You have a modern GPU and want expressive output: Try Orpheus. The emotion-style tags are worth testing if delivery matters.
You need broad language coverage or fast CPU inference: Try Piper. It may not be the most natural-sounding model for every use case, but it is a strong speed-oriented baseline.
You need voice cloning for a specific voice: Evaluate XTTS-v2 and other current cloning workflows. Plan for consent, licensing, hardware requirements, and larger model downloads.

If you want local TTS on Mac without managing Python dependencies, Spokio is a native Mac app powered by Chatterbox Turbo. It supports local voice cloning, batch export, MP3/WAV/AIFF/M4A output, Apple Silicon and Intel Macs, and no cloud uploads for text, audio, or voice samples. But if you want to understand how models work under the hood, running them yourself in Python is a good way to learn.

Local TTS with Python: A Practical Guide to Open-Source Speech Models

Prerequisites

1. Kokoro — Lightweight Local TTS

Available Voices

Long Text and Chunking

ONNX for Even Smaller Footprint

When to Use Kokoro

2. Orpheus — Expressive, Heavier TTS

Installation via Pip

Basic Inference

Quantized for Lower VRAM

Emotion Tag Reference

Available Voices

System Requirements

When to Use Orpheus

3. Piper — Fast CPU Inference

Installation

Downloading a Voice Model

Inference

Streaming Audio

Voice Catalog

When to Use Piper

4. XTTS-v2 — Voice Cloning from 3 Seconds

Installation

Basic Inference

Voice Cloning from a Microphone

Streaming Inference

Cross-Lingual Cloning

Supported Languages

System Requirements

When to Use XTTS-v2

Quick Reference Comparison

Performance Considerations

Building a Real Application

Text Preprocessing

Chunking Large Documents

Audio Concatenation

Queue Management

Which Model Should You Start With?

References

More from the blog

Try Spokio for Mac.

Product

Features

Use Cases

Compare