voxtralmistralopen source ttsvoice cloningelevenlabs alternativedevelopers

Voxtral TTS: Mistral's Open-Weight Speech Model for Voice Cloning

Mistral's Voxtral TTS is an open-weight speech model with zero-shot voice cloning, 9-language support, streaming around 90ms time-to-first-audio, and a CC BY-NC license.

Published on May 22, 20267 min read

In March 2026, Mistral introduced Voxtral TTS, an open-weight text-to-speech model with zero-shot voice cloning.

In Mistral’s arXiv paper, native-speaker human evaluations preferred Voxtral TTS over ElevenLabs Flash v2.5 for multilingual voice cloning, with a reported 68.4% win rate. That is an important benchmark, but it is still one evaluation setup, not a universal quality guarantee.

Here is what Voxtral is, why it matters, and what it means for the TTS ecosystem.

What is Voxtral TTS?

Voxtral is a text-to-speech model developed by Mistral AI, the French company behind the Mistral and Mixtral LLM series. Mistral’s model card lists Voxtral TTS as an Open v26.03 model, and the arXiv paper was submitted on March 26, 2026. The released weights use a CC BY-NC license.

Spec Value
Parameters 4B
Languages English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
Voice cloning 3 seconds of reference audio
Latency 90ms time-to-first-audio
License CC BY-NC
Hardware min 3GB RAM
Hardware recommended 16GB GPU
API Mistral-hosted pricing listed as $16 per million characters

Why it matters

Voxtral matters because it challenges the assumption that strong TTS quality always requires a closed cloud API.

Before Voxtral, the best open-source TTS models were competitive but not dominant. Kokoro won in speed and efficiency. Chatterbox matched ElevenLabs in voice cloning. Qwen3-TTS excelled at multilingual. But none had the kind of head-to-head blind-test victory that makes commercial users switch.

Voxtral’s reported win rate against ElevenLabs Flash v2.5 changes that calculation for at least one multilingual voice-cloning benchmark. Developers should still test it on their own voices, languages, and deployment constraints.

Voice cloning in 3 seconds

Voxtral’s voice cloning is designed to work from short reference audio. Mistral describes generation from as little as 3 seconds of reference audio. For context:

  • Voxtral: 3 seconds
  • Qwen3-TTS: ~3 seconds
  • XTTS-v2: ~6 seconds
  • ElevenLabs Instant Cloning: ~60 seconds

This makes Voxtral especially useful for rapid prototyping, personalized voice applications, and any workflow where reference audio is scarce.

Language coverage

Voxtral supports 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. Notably absent are Chinese, Japanese, and Korean — the model is European-focused, reflecting Mistral’s geographic base.

For multilingual European content, Voxtral is worth evaluating seriously. For Asian-language content, compare it against models focused on those languages, such as Qwen-family or CosyVoice releases.

Running Voxtral locally

Voxtral’s local performance depends on the runtime, hardware, and model packaging. Treat any speed claims as deployment-specific and benchmark on your target Mac or GPU before promising realtime behavior.

The model is available as open weights, and Mistral also offers a hosted API for users who prefer not to self-host.

What Voxtral means for the TTS landscape

The release of Voxtral accelerates two ongoing trends:

  1. The gap between open-weight and commercial TTS is narrowing. Voxtral is another sign that open-weight models are becoming serious options for production evaluation.

  2. Voice cloning is becoming more common. Many major TTS releases now include some form of zero-shot or short-reference cloning. The question is no longer just “can it clone voices?” but “how well, how safely, and under what license?”

  3. License matters more than ever. Voxtral’s CC BY-NC license is not the same as a permissive commercial license. Developers building commercial products should verify their use case before integrating.

Where Spokio fits

Spokio is not a Voxtral or multi-model wrapper. It is an offline Mac text-to-speech app powered by Chatterbox Turbo.

For Mac users who want local English TTS without managing model downloads, GPU configuration, or Python environments, Spokio provides a native Mac workflow with local voice cloning from short samples, background processing, batch export, MP3/WAV/AIFF/M4A export, Apple Silicon and Intel support, and no cloud uploads for text, audio, or voice samples.

More from the blog