Two years ago, recommending open-source TTS meant managing expectations. “It sounds okay for a free model” was the ceiling.
In 2026, that conversation is changing fast. Open-source models now compete seriously with commercial APIs in public benchmarks and vendor-published blind preference tests. Kokoro reports a 4.5 MOS on benchmark evaluations. Chatterbox reports a 63.75% preference rate against ElevenLabs in Resemble-published comparisons. Fish Audio S2 Pro ranks highly on the TTS Arena leaderboard.
On TTS Arena, the gap between the strongest open-weight systems and leading proprietary systems has narrowed significantly. The exact ranking changes as new models and votes arrive, but the direction is clear: open TTS is no longer just a fallback.
Here are the five models that matter right now.
Quick Comparison
| Model | Params | VRAM | Voice Cloning | Languages | License | TTS Arena Elo |
|---|---|---|---|---|---|---|
| Fish Audio S2 Pro | ~4.4B | 8-16GB | Yes (3s clip) | 80+ | Research | 1129 |
| Chatterbox Turbo | 350M | 4-8GB | Yes (5-10s) | 23 (MTL) | MIT | 1006 |
| Kokoro-82M | 82M | 2-3GB | No (54 presets) | 11 | Apache 2.0 | 1056 |
| Qwen3-TTS | 600M | ~4GB | Yes (3s clip) | 10 | Apache 2.0 | — |
| Orpheus | 3B | 8-12GB | No | EN | Llama 3.2 | — |
1. Fish Audio S2 Pro — The Benchmark King
Fish Audio S2 Pro is the highest-ranked open-weight model on the independent TTS Arena leaderboard at Elo 1129 — ahead of several commercial APIs including OpenAI TTS-1 (Elo 1102) and Google’s Studio model (Elo 1062). It wins 61% of its matchups.
What makes it stand out:
- 80+ language support — the widest coverage of any open-source TTS model
- Voice cloning from approximately 3 seconds of reference audio
- Inline emotion tags for expressive control
- Sits at rank 11 overall, bested only by closed-source systems
The catch: The license is complex. The code uses a Fish Audio Research License, and weights vary by model version. Commercial use requires careful reading of the terms. At 4.4B parameters, it also requires significant GPU memory for self-hosting.
Best for: Applications where audio quality is the single priority and you can navigate the licensing.
# Minimal usage pattern
from fish_speech import FishSpeech
model = FishSpeech.from_pretrained("fishaudio/s2-pro")
wav = model.generate("Your text here", reference_audio="voice_sample.wav")2. Chatterbox — Best Voice Cloning (MIT)
Chatterbox by Resemble AI is the model that made people stop saying “ElevenLabs is unbeatable.” Resemble-published blind evaluations report a 63.75% preference rate for Chatterbox over ElevenLabs. It is notable for controllable emotion exaggeration — a single parameter (0.25-2.0) that dials expressiveness from monotone to dramatic.
Three variants:
| Variant | Params | Languages | Key Feature |
|---|---|---|---|
| Original | 500M | EN | Emotion exaggeration + CFG tuning |
| Multilingual | 500M | 23 | Zero-shot cloning across languages |
| Turbo | 350M | EN | Paralinguistic tags, sub-200ms latency |
Architecture highlights:
- Modified Llama 3 backbone (500M params) for autoregressive speech token generation
- Alignment-informed inference that detects and suppresses hallucinations in real time
- Conditional flow matching decoder for token-to-mel conversion (distilled to 1 step in Turbo)
- PerTh perceptual watermarking on every output
- MIT license — no restrictions on commercial use
Best for: Voice cloning projects, production deployments needing emotion control, and anyone who wants ElevenLabs-quality output without the API bill.
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate(
"This is an open-source voice clone.",
audio_prompt_path="reference.wav",
exaggeration=0.7,
)3. Kokoro-82M — The Efficiency Champion
Kokoro is the model that redefined what “good enough” means. At 82 million parameters — less than 10% the size of most competitors — it achieves a 4.5 Mean Opinion Score and 17% Character Error Rate on the Tricky TTS benchmark, beating every proprietary model tested in that evaluation.
The numbers that matter:
- 210x real-time on an RTX 4090
- 36x real-time on a free T4 Colab GPU
- 5x real-time on CPU
- 54 built-in voices (no cloning needed)
- Apache 2.0 license
- Supports English, Japanese, Chinese, Korean, French, German, Italian, Portuguese, Spanish, Hindi, Russian
Architecture: Built on StyleTTS 2 and ISTFTNet — no diffusion, no autoregressive loop, just a single forward pass. This is why it is fast enough to run on a laptop without a GPU.
The tradeoff: No voice cloning. You get 54 preset voices. For many use cases that is enough, but if you need custom voices, Chatterbox or Qwen3-TTS are better choices.
Best for: Desktop apps, CPU deployments, real-time applications, and anyone who needs production-quality speech with minimal hardware.
from kokoro import KPipeline
pipeline = KPipeline(lang_code="a")
generator = pipeline("Hello, this is Kokoro speaking.", voice="af_heart")
for _, _, audio in generator:
sf.write("output.wav", audio, 24000)4. Qwen3-TTS — The Multilingual Newcomer
Qwen3-TTS by Alibaba was released in early 2026, and it immediately became a serious contender. At 600M parameters with Apache 2.0 licensing, it reports low-latency streaming, 10 language support, and short-reference zero-shot voice cloning.
Key specs:
- 0.6B and 1.7B model sizes
- 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Reported low time-to-first-audio for streaming
- Instruction-based emotion control
- Apache 2.0 on both code and weights
Where it shines: Multilingual voice cloning with a permissive open-release license. Qwen3-TTS is appealing for commercial multilingual deployments, but teams should still verify the current repository, model card, and license terms before shipping.
Best for: Multilingual applications, streaming voice agents, and teams that need a permissive license with no ambiguity.
from qwen_tts import Qwen3TTS
model = Qwen3TTS.from_pretrained("Qwen/Qwen3-TTS-0.6B")
wav = model.generate(
"Bonjour, ceci est un test multilingue.",
reference_audio="sample.wav",
language="fr",
)5. Orpheus — Most Expressive (Llama 3.2)
Orpheus by Canopy Labs is built on Meta’s Llama 3.2 architecture and uses SNAC audio tokens for high-fidelity output. It scores 4.2 MOS and 21% CER on the Tricky TTS benchmark, with particularly strong prosody and emotional range.
What makes it different:
- Native emotion tags for controlling delivery
- Llama 3.2 backbone — benefits from the broader Llama ecosystem
- Smooth, realistic audio with good handling of paralinguistic content
- 3B parameter model
The tradeoff: Requires GPU (8-12GB VRAM), requires “Built with Llama” attribution, and no built-in voice cloning. Fine-tuning is possible but requires expertise.
Best for: Narrative content, audiobooks, and applications where emotional delivery matters more than voice flexibility.
from orpheus import OrpheusTTS
model = OrpheusTTS.from_pretrained("orpheus-3b")
wav = model.generate(
"She hesitated, then whispered: <whisper>I shouldn't be here.</whisper>",
emotion="anxious",
)Honorable Mentions
These models did not make the top five but are worth knowing about:
- CosyVoice 3.0 (Alibaba) — Apache 2.0, 0.5B, excellent cross-lingual voice cloning, 9 languages. Solid choice if Qwen3-TTS does not fit your language requirements.
- Voxtral TTS (Mistral) — 4B params, 90ms latency, 9 languages, 3-second cloning. High quality but uses a Creative Commons license that may not suit commercial products.
- Dia2 (Nari Labs) — Apache 2.0, built for multi-speaker dialogue with speaker tags. 1B-2B variants. Best for conversational audio.
- Coqui XTTS-v2 — The most battle-tested open-weight voice cloning model. 17 languages, short-reference cloning, Coqui Public Model License. Slower than newer models but reliable and well-documented.
- Piper — MIT license, runs on a Raspberry Pi, 30+ languages. Low quality (3.3 MOS) but unmatched for edge deployment.
How to Choose
| If you need… | Pick this |
|---|---|
| Highest raw audio quality | Fish Audio S2 Pro |
| Best voice cloning + open license | Chatterbox (MIT) |
| Runs on a laptop, no GPU needed | Kokoro |
| Clean multilingual + streaming | Qwen3-TTS (Apache 2.0) |
| Emotional narration | Orpheus |
| Edge / embedded / Raspberry Pi | Piper |
| Multi-speaker dialogue | Dia2 |
The Bottom Line
Open-source TTS in 2026 has reached production maturity. The choice is no longer “open source vs commercial quality” but “which open source model fits my specific constraint.”
If you can afford the GPU and navigate the license, Fish Audio S2 Pro delivers the best quality. If you want voice cloning with zero commercial restrictions, Chatterbox is the winner. If you need something that runs on a customer’s laptop today, Kokoro is the only honest answer.
And if you are a Mac creator who cares about privacy, offline access, and quick revision workflows, that is the product layer Spokio focuses on. Spokio uses Chatterbox Turbo for local voice generation and wraps it in a Mac-native workflow so users do not have to manage model setup themselves.
Elo scores sourced from the TTS Arena leaderboard (May 2026). Quality benchmarks from Trelis Research’s Tricky TTS evaluation and independent model comparisons.
