open source ttslocal ttsdeveloperson-device aitext-to-speechvoice cloning

Best Open Source TTS Models 2026: Ranking the Top 5

The five open-source TTS models worth your time in 2026 — Fish Audio S2 Pro, Chatterbox, Kokoro, Qwen3-TTS, and Orpheus — ranked by quality, speed, voice cloning, and real-world deployability.

Updated on May 21, 202612 min read

Two years ago, recommending open-source TTS meant managing expectations. “It sounds okay for a free model” was the ceiling.

In 2026, that conversation is changing fast. Open-source models now compete seriously with commercial APIs in public benchmarks and vendor-published blind preference tests. Kokoro reports a 4.5 MOS on benchmark evaluations. Chatterbox reports a 63.75% preference rate against ElevenLabs in Resemble-published comparisons. Fish Audio S2 Pro ranks highly on the TTS Arena leaderboard.

On TTS Arena, the gap between the strongest open-weight systems and leading proprietary systems has narrowed significantly. The exact ranking changes as new models and votes arrive, but the direction is clear: open TTS is no longer just a fallback.

Here are the five models that matter right now.

Quick Comparison

Model Params VRAM Voice Cloning Languages License TTS Arena Elo
Fish Audio S2 Pro ~4.4B 8-16GB Yes (3s clip) 80+ Research 1129
Chatterbox Turbo 350M 4-8GB Yes (5-10s) 23 (MTL) MIT 1006
Kokoro-82M 82M 2-3GB No (54 presets) 11 Apache 2.0 1056
Qwen3-TTS 600M ~4GB Yes (3s clip) 10 Apache 2.0
Orpheus 3B 8-12GB No EN Llama 3.2

1. Fish Audio S2 Pro — The Benchmark King

Fish Audio S2 Pro is the highest-ranked open-weight model on the independent TTS Arena leaderboard at Elo 1129 — ahead of several commercial APIs including OpenAI TTS-1 (Elo 1102) and Google’s Studio model (Elo 1062). It wins 61% of its matchups.

What makes it stand out:

  • 80+ language support — the widest coverage of any open-source TTS model
  • Voice cloning from approximately 3 seconds of reference audio
  • Inline emotion tags for expressive control
  • Sits at rank 11 overall, bested only by closed-source systems

The catch: The license is complex. The code uses a Fish Audio Research License, and weights vary by model version. Commercial use requires careful reading of the terms. At 4.4B parameters, it also requires significant GPU memory for self-hosting.

Best for: Applications where audio quality is the single priority and you can navigate the licensing.

# Minimal usage pattern
from fish_speech import FishSpeech

model = FishSpeech.from_pretrained("fishaudio/s2-pro")
wav = model.generate("Your text here", reference_audio="voice_sample.wav")

2. Chatterbox — Best Voice Cloning (MIT)

Chatterbox by Resemble AI is the model that made people stop saying “ElevenLabs is unbeatable.” Resemble-published blind evaluations report a 63.75% preference rate for Chatterbox over ElevenLabs. It is notable for controllable emotion exaggeration — a single parameter (0.25-2.0) that dials expressiveness from monotone to dramatic.

Three variants:

Variant Params Languages Key Feature
Original 500M EN Emotion exaggeration + CFG tuning
Multilingual 500M 23 Zero-shot cloning across languages
Turbo 350M EN Paralinguistic tags, sub-200ms latency

Architecture highlights:

  • Modified Llama 3 backbone (500M params) for autoregressive speech token generation
  • Alignment-informed inference that detects and suppresses hallucinations in real time
  • Conditional flow matching decoder for token-to-mel conversion (distilled to 1 step in Turbo)
  • PerTh perceptual watermarking on every output
  • MIT license — no restrictions on commercial use

Best for: Voice cloning projects, production deployments needing emotion control, and anyone who wants ElevenLabs-quality output without the API bill.

from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate(
    "This is an open-source voice clone.",
    audio_prompt_path="reference.wav",
    exaggeration=0.7,
)

3. Kokoro-82M — The Efficiency Champion

Kokoro is the model that redefined what “good enough” means. At 82 million parameters — less than 10% the size of most competitors — it achieves a 4.5 Mean Opinion Score and 17% Character Error Rate on the Tricky TTS benchmark, beating every proprietary model tested in that evaluation.

The numbers that matter:

  • 210x real-time on an RTX 4090
  • 36x real-time on a free T4 Colab GPU
  • 5x real-time on CPU
  • 54 built-in voices (no cloning needed)
  • Apache 2.0 license
  • Supports English, Japanese, Chinese, Korean, French, German, Italian, Portuguese, Spanish, Hindi, Russian

Architecture: Built on StyleTTS 2 and ISTFTNet — no diffusion, no autoregressive loop, just a single forward pass. This is why it is fast enough to run on a laptop without a GPU.

The tradeoff: No voice cloning. You get 54 preset voices. For many use cases that is enough, but if you need custom voices, Chatterbox or Qwen3-TTS are better choices.

Best for: Desktop apps, CPU deployments, real-time applications, and anyone who needs production-quality speech with minimal hardware.

from kokoro import KPipeline

pipeline = KPipeline(lang_code="a")
generator = pipeline("Hello, this is Kokoro speaking.", voice="af_heart")
for _, _, audio in generator:
    sf.write("output.wav", audio, 24000)

4. Qwen3-TTS — The Multilingual Newcomer

Qwen3-TTS by Alibaba was released in early 2026, and it immediately became a serious contender. At 600M parameters with Apache 2.0 licensing, it reports low-latency streaming, 10 language support, and short-reference zero-shot voice cloning.

Key specs:

  • 0.6B and 1.7B model sizes
  • 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Reported low time-to-first-audio for streaming
  • Instruction-based emotion control
  • Apache 2.0 on both code and weights

Where it shines: Multilingual voice cloning with a permissive open-release license. Qwen3-TTS is appealing for commercial multilingual deployments, but teams should still verify the current repository, model card, and license terms before shipping.

Best for: Multilingual applications, streaming voice agents, and teams that need a permissive license with no ambiguity.

from qwen_tts import Qwen3TTS

model = Qwen3TTS.from_pretrained("Qwen/Qwen3-TTS-0.6B")
wav = model.generate(
    "Bonjour, ceci est un test multilingue.",
    reference_audio="sample.wav",
    language="fr",
)

5. Orpheus — Most Expressive (Llama 3.2)

Orpheus by Canopy Labs is built on Meta’s Llama 3.2 architecture and uses SNAC audio tokens for high-fidelity output. It scores 4.2 MOS and 21% CER on the Tricky TTS benchmark, with particularly strong prosody and emotional range.

What makes it different:

  • Native emotion tags for controlling delivery
  • Llama 3.2 backbone — benefits from the broader Llama ecosystem
  • Smooth, realistic audio with good handling of paralinguistic content
  • 3B parameter model

The tradeoff: Requires GPU (8-12GB VRAM), requires “Built with Llama” attribution, and no built-in voice cloning. Fine-tuning is possible but requires expertise.

Best for: Narrative content, audiobooks, and applications where emotional delivery matters more than voice flexibility.

from orpheus import OrpheusTTS

model = OrpheusTTS.from_pretrained("orpheus-3b")
wav = model.generate(
    "She hesitated, then whispered: <whisper>I shouldn't be here.</whisper>",
    emotion="anxious",
)

Honorable Mentions

These models did not make the top five but are worth knowing about:

  • CosyVoice 3.0 (Alibaba) — Apache 2.0, 0.5B, excellent cross-lingual voice cloning, 9 languages. Solid choice if Qwen3-TTS does not fit your language requirements.
  • Voxtral TTS (Mistral) — 4B params, 90ms latency, 9 languages, 3-second cloning. High quality but uses a Creative Commons license that may not suit commercial products.
  • Dia2 (Nari Labs) — Apache 2.0, built for multi-speaker dialogue with speaker tags. 1B-2B variants. Best for conversational audio.
  • Coqui XTTS-v2 — The most battle-tested open-weight voice cloning model. 17 languages, short-reference cloning, Coqui Public Model License. Slower than newer models but reliable and well-documented.
  • Piper — MIT license, runs on a Raspberry Pi, 30+ languages. Low quality (3.3 MOS) but unmatched for edge deployment.

How to Choose

If you need… Pick this
Highest raw audio quality Fish Audio S2 Pro
Best voice cloning + open license Chatterbox (MIT)
Runs on a laptop, no GPU needed Kokoro
Clean multilingual + streaming Qwen3-TTS (Apache 2.0)
Emotional narration Orpheus
Edge / embedded / Raspberry Pi Piper
Multi-speaker dialogue Dia2

The Bottom Line

Open-source TTS in 2026 has reached production maturity. The choice is no longer “open source vs commercial quality” but “which open source model fits my specific constraint.”

If you can afford the GPU and navigate the license, Fish Audio S2 Pro delivers the best quality. If you want voice cloning with zero commercial restrictions, Chatterbox is the winner. If you need something that runs on a customer’s laptop today, Kokoro is the only honest answer.

And if you are a Mac creator who cares about privacy, offline access, and quick revision workflows, that is the product layer Spokio focuses on. Spokio uses Chatterbox Turbo for local voice generation and wraps it in a Mac-native workflow so users do not have to manage model setup themselves.


Elo scores sourced from the TTS Arena leaderboard (May 2026). Quality benchmarks from Trelis Research’s Tricky TTS evaluation and independent model comparisons.

More from the blog