Alibaba’s Tongyi Lab released CosyVoice 3 in May 2025 as the latest iteration of their open-source speech synthesis family. The technical report (arXiv:2505.17589) describes a model that scales training data from 10,000 to 1,000,000 hours, increases parameters from 0.5B to 1.5B, and introduces a supervised multi-task speech tokenizer plus a novel differentiable reward optimization (DiffRO) method for post-training.
This article breaks down the architecture, the tokenizer design, the three-stage pipeline, the DiffRO post-training method, and what makes CosyVoice 3 different from its predecessors.
Why CosyVoice 3 Matters
CosyVoice 3 is Alibaba’s answer to “in-the-wild” speech generation — handling diverse domains (e-commerce, navigation, finance, education), multiple languages, Chinese dialects, varied text formats, and emotional speech. It builds on CosyVoice 2’s streaming LLM + flow matching architecture but makes four fundamental upgrades:
- A new supervised multi-task speech tokenizer — replaces CosyVoice 2’s SenseVoice-based tokenizer with one derived from MinMo, trained on ASR, emotion recognition, language ID, audio event detection, and speaker analysis
- DiffRO (Differentiable Reward Optimization) — a post-training method that optimizes speech tokens directly via backprop instead of expensive RL loops
- Data scaling — 10K hours → 1M hours across 9 languages and 18+ Chinese dialects
- Model scaling — LM 0.5B → 1.5B, CFM renderer 100M → 300M with DiT backbone
| Metric | CosyVoice 2 (0.5B) | CosyVoice 3 (0.5B base) | CosyVoice 3 (0.5B + RL) |
|---|---|---|---|
| Seed-TTS Eval CER (ZH) | 1.45% | 1.21% | 0.81% |
| Seed-TTS Eval WER (EN) | 2.57% | 2.24% | 1.68% |
| Speaker Similarity (ZH) | 75.7% | 78.0% | 77.4% |
| Speaker Similarity (EN) | 65.9% | 71.8% | 69.5% |
The Three-Stage Pipeline
CosyVoice 3 follows a three-stage architecture that has become the dominant pattern in modern TTS:
Text → LLM Token Generator → DiT Flow Matching → HiFi-GAN Vocoder → AudioStage 1: LLM — Speech Token Generation
The language model is a Qwen2.5-0.5B backbone that generates discrete FSQ (Finite Scalar Quantization) speech tokens autoregressively.
Input: text tokens + speaker embedding + instruction tokens
→ Qwen2.5-0.5B (24 layers, 896 hidden dim, GQA 14/2)
→ Autoregressive prediction of FSQ speech tokens at 25 Hz
→ Output: discrete token sequence [t₁, t₂, ..., tₙ]LLM architecture details:
| Parameter | Value |
|---|---|
| Backbone | Qwen2.5-0.5B |
| Layers | 24 |
| Hidden dimension | 896 |
| Query heads | 14 |
| Key/Value heads | 2 (GQA) |
| FSQ vocabulary | 6,561 |
| Quantization | 4-bit (inference) |
| Token rate | 25 Hz |
| Sampling | Top-k=25, Top-p=0.8, RAS enabled |
Repetition Aware Sampling (RAS) — from VALL-E 2 — penalizes tokens that appeared in the last 10 generated positions, preventing repetitive audio artifacts. This is critical for output stability in long-form generation.
Silent token filtering: CosyVoice 3 filters up to 5 consecutive silent FSQ tokens (11 specific token IDs are classified as silence), preventing long pauses in generated speech.
Stage 2: DiT Flow Matching — Mel-Spectrogram Synthesis
CosyVoice 3 replaces the U-Net-style flow matching encoder from CosyVoice 2 with a Diffusion Transformer (DiT) backbone. This is a key architectural change — DiT scales better with compute and eliminates the need for a separate text encoder and length regularization module.
Discrete speech tokens [t₁, t₂, ..., tₙ]
→ Token embeddings
→ DiT (22 layers, 1024 dim, 16 attention heads)
→ AdaLN conditioning (speaker embedding, CFG guidance)
→ Euler ODE solver (10 steps)
→ Mel-spectrogram (80-band)DiT parameters:
| Parameter | Value |
|---|---|
| Layers | 22 |
| Dimension | 1024 |
| Attention heads | 16 |
| Conditioning | AdaLN (Adaptive Layer Norm) |
| Parameters | 300M |
| ODE solver | Euler, 10 steps |
| CFG rate | 0.7 |
The DiT architecture uses AdaLN (Adaptive Layer Normalization) for conditioning — speaker embeddings, style instructions, and classifier-free guidance scale are injected via learned affine transformations at each transformer block. This is more parameter-efficient than cross-attention conditioning.
The frame rate mismatch between speech tokens (25 Hz) and mel features (higher resolution) is handled by a simple interpolation operation — no more complicated text encoders or length regulators.
Stage 3: HiFi-GAN Vocoder — Waveform Generation
The final stage uses a Neural Source Filter (NSF) HiFi-GAN vocoder to convert mel-spectrograms to 24 kHz waveforms.
Mel-spectrogram (80-band)
→ NSF HiFi-GAN (8 harmonics, upsample ratio 480×)
→ Inverse STFT (n_fft=16, hop=4)
→ 24 kHz audio waveform| Parameter | Value |
|---|---|
| Harmonics | 8 |
| Upsample ratio | 480× |
| ISTFT | n_fft=16, hop=4 |
| Output sample rate | 24 kHz |
The Supervised Multi-Task Speech Tokenizer
CosyVoice 3’s most important innovation is its speech tokenizer. Unlike CosyVoice 2 which inserted an FSQ module into SenseVoice-Large (an ASR model), CosyVoice 3 uses MinMo — a multimodal LLM trained on 1.4M+ hours of speech with SOTA performance on spoken dialogue, multilingual ASR, and emotion recognition.
Tokenizer Architecture
Input speech X
→ Voice Encoder 1 (12 Transformer blocks with RoPE)
→ FSQ module:
- Project to D-dimensional low-rank space
- Bounded round operation ROUND (quantize to [-K, K])
- Project back to original dimension
- Compute index from quantized values
→ Voice Encoder 2 + MinMo LLM (training only)
→ Multi-task supervision: ASR, LID, SER, AED, SAThe FSQ module is simpler than traditional RVQ (Residual Vector Quantization). It projects into a low-dimensional space, applies a bounded round operation for quantization, and computes a single index. This produces a 25 Hz token stream — 25 discrete tokens per second of audio.
Multi-Task Supervision
The tokenizer is trained on ~530,000 hours of speech with five supervision signals:
| Task | Label | Purpose |
|---|---|---|
| ASR | Text transcription | Capture phonetic content |
| Language ID | Language label | Enable multilingual tokens |
| Speech Emotion Recognition | Emotion label | Preserve paralinguistic cues |
| Audio Event Detection | Event labels | Detect non-speech events |
| Speaker Analysis | Speaker identity | Preserve voice characteristics |
The key insight: by supervising the tokenizer on multiple speech understanding tasks simultaneously, the discrete tokens learn to retain semantic, paralinguistic, and speaker information — not just acoustic reconstruction. This is what enables CosyVoice 3’s improved prosody naturalness compared to CosyVoice 2.
Comparison of tokenizer approaches:
| Approach | Base Model | Token Rate | Supervision |
|---|---|---|---|
| CosyVoice v1 | Custom encoder | Variable | Semantic only |
| CosyVoice 2 | SenseVoice-Large + FSQ | Variable | ASR only |
| CosyVoice 3 | MinMo + FSQ | 25 Hz fixed | ASR + LID + SER + AED + SA |
DiffRO: Differentiable Reward Optimization
Post-training TTS models with reinforcement learning is difficult because the audio must go through the CFM model and vocoder before a reward can be computed — and those downstream models are computationally expensive. Worse, after processing, all audio samples sound similar, making it hard for a reward model to distinguish good from bad.
CosyVoice 3’s DiffRO avoids this entirely by operating directly on speech tokens.
How DiffRO Works
1. Train a Token2Text model (ASR-like) on speech token sequences
→ Given tokens, predict text posterior probabilities
2. For each training prompt:
a. LLM generates candidate speech token sequences
b. Token2Text computes reward = P(correct_text | tokens)
c. Gumbel-Softmax makes the token sampling differentiable
d. Backprop through the LLM to maximize reward
3. Add KL divergence on token logits to prevent drift
from the reference modelWhy Gumbel-Softmax? Speech token sampling is normally a discrete operation (argmax or categorical sampling) — not differentiable. Gumbel-Softmax replaces this with a continuous relaxation that can be annealed toward true categorical sampling during training, enabling standard backpropagation.
Multi-task rewards: DiffRO can incorporate multiple reward signals:
| Reward | Signal Source |
|---|---|
| Content accuracy | Token2Text posterior |
| Emotion accuracy | Emotion classifier on tokens |
| Audio quality | MOS prediction model |
| Instruction adherence | Style classifier |
The KL divergence term keeps the post-trained model from collapsing or drifting too far:
Loss = -Reward + β × KL(π_post || π_ref)Where β controls the strength of the KL penalty.
Data Scaling and Multilingual Coverage
CosyVoice 3’s training data expansion from 10K to 1M hours involved a sophisticated data pipeline:
Data Pipeline
Raw in-the-wild audio (web, podcasts, video)
→ ASR transcription (multiple models)
→ Pairwise WER filtering (< 15% disagreement)
→ Forced alignment for punctuation adjustment
(add comma if pause > 300ms, remove if < 50ms)
→ Volume normalization
→ Speech/text length ratio filtering (remove bottom 1%, top 5%)
→ Clean paired datasetLanguage Coverage
| Category | Coverage |
|---|---|
| Primary languages | Chinese, English, Japanese, Korean, Russian, French, German, Spanish, Italian |
| Chinese dialects | 18+ (Guangdong/Cantonese, Minnan, Sichuan, Dongbei, Shanxi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) |
| Text formats | Standard, normalized, inverse-normalized, mixed phoneme/character |
| Domains | E-commerce, navigation, finance, education, conversation, speech, singing |
Self-training: An early version of CosyVoice 3 was used to generate synthetic data for rare cases (unusual text formats, edge-case pronunciations), then those generations were filtered and added to the training set.
Model Scaling
| Component | CosyVoice 2 | CosyVoice 3 |
|---|---|---|
| LLM parameters | 0.5B | 0.5B base / 1.5B |
| CFM parameters | 100M (U-Net) | 300M (DiT) |
| Training data | ~10K hours | ~1M hours |
| Languages | ZH, EN | 9 languages, 18+ dialects |
Voice Cloning Mechanism
CosyVoice 3 supports zero-shot voice cloning by passing a reference audio sample alongside the text. The model encodes the reference through its own pipeline (the Soniqo macOS port uses a CAM++ speaker encoder for this step).
How Cloning Works
Reference audio (10-20 seconds)
→ Speaker encoder (e.g., CAM++ in Soniqo port)
→ 192-dim speaker embedding
→ Affine projection (192 → 80)
→ Conditions DiT flow matching via AdaLNThe speaker embedding conditions the DiT flow matching decoder, not the LLM. This means the LLM generates content-appropriate tokens, and the voice timbre is injected during the mel-spectrogram synthesis stage.
| Property (Soniqo port) | Value |
|---|---|
| Model | CAM++ (Context-Aware Masking++) |
| Embedding | 192 dimensions |
| Backend | CoreML (Neural Engine) |
| Size | ~14 MB |
Control Tokens
Internally, CosyVoice 3 uses <|fl_*|> tokens to switch between modes. The API methods (inference_zero_shot, inference_instruct2, etc.) emit these automatically — users never write them by hand:
| Token | Mode | API Method |
| ----- | -------------------- | ---------- | ------------------------------ | --------------------------------------------- |
| < | fl_speaker_clone | > | Zero-shot voice cloning | inference_zero_shot() |
| < | fl_speaker_instruct | > | Instruction-only synthesis | inference_instruct2() with instruction text |
| < | fl_speaker_instruct2 | > | Instruction + cloning combined | inference_instruct2() with --voice-sample |
| < | fl_save_speaker | > | Persist speaker embedding | add_zero_shot_spk() |
Cross-Lingual Voice Cloning
Because the tokenizer was trained on multiple languages, CosyVoice 3 can clone a voice from one language and synthesize speech in another — e.g., clone a Japanese speaker and generate English or Chinese audio with the same voice characteristics. For Japanese synthesis, the text must be transcribed to katakana first.
Instruction Control and Pronunciation Inpainting
Instruction Tags
CosyVoice 3 supports natural language instructions passed via the instruct_text parameter. The instruction is placed before the <|endofprompt|> separator:
instruct_text = "Speak cheerfully and quickly <|endofprompt|>"
→ LLM conditions on "cheerful" and "fast" prosody
→ DiT applies corresponding style conditioning
→ tts_text argument contains the actual text to speakAlternatively, inline style tags like [laugh] and [breath] can be embedded directly in the text, supported in both inference_cross_lingual and inference_zero_shot modes.
Supported instruction dimensions:
- Language/dialect selection
- Emotion (happy, sad, angry, excited, gentle)
- Speaking rate (slow, fast)
- Volume (loud, soft)
- Style (broadcast, conversational, narrative)
Pronunciation Inpainting
A production-oriented feature: CosyVoice 3 supports pronunciation inpainting by inserting Chinese Pinyin in square brackets after the target character:
Input: "高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。"Here [j][ǐ] overrides the default pronunciation of the preceding character. English CMU phonemes can be used similarly. This gives fine-grained control over rare words, proper nouns, and technical terms without retraining.
Bi-Streaming Inference
CosyVoice 3 supports both text-in streaming and audio-out streaming, achieving 150ms first-chunk latency.
User types text character by character
→ LLM generates speech tokens incrementally
→ DiT flow matching processes in chunks
→ HiFi-GAN outputs audio frames as they're ready
→ Audio plays back with ~150ms startup delayThe streaming architecture uses:
- KV cache — cached transformer key/value states avoid recomputation
- SDPA (Scaled Dot-Product Attention) — efficient attention for incremental decoding
- Chunk-aware flow matching — processes audio in overlapping windows
- Lookahead mechanism —
pre_lookahead_lenparameter controls quality/latency tradeoff
Deployment Backends
| Backend | Target | Speedup | Platform | CV3 Support |
|---|---|---|---|---|
| TensorRT | Flow decoder ODE solver | ~4× | GPU | ✅ |
| vLLM | LLM module | 2-3× batch | GPU | ✅ |
| TensorRT-LLM | Full pipeline | ~4× LLM | GPU | ✅ |
Note: TorchScript JIT is not available for CosyVoice 3 due to DiT architecture incompatibility.
Benchmark Results
Seed-TTS Eval
| Model | CER (ZH) ↓ | Speaker Sim (ZH) ↑ | WER (EN) ↓ | Speaker Sim (EN) ↑ |
|---|---|---|---|---|
| Human | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | 1.12 | 79.6 | 2.25 | 76.2 |
| CosyVoice 2 (0.5B) | 1.45 | 75.7 | 2.57 | 65.9 |
| Fun-CosyVoice3 (0.5B) | 1.21 | 78.0 | 2.24 | 71.8 |
| Fun-CosyVoice3 + RL | 0.81 | 77.4 | 1.68 | 69.5 |
Hard Test Set
| Model | CER ↓ | Speaker Sim ↑ |
|---|---|---|
| CosyVoice 2 | 6.83 | 72.4 |
| CosyVoice 3 (0.5B) | 6.71 | 75.8 |
| CosyVoice 3 + RL | 5.44 | 75.0 |
The hard test set includes challenging in-the-wild samples with background noise, varied recording conditions, and non-standard speech — CosyVoice 3’s improved tokenizer and DiT backbone handle this substantially better than CosyVoice 2.
Comparison with CosyVoice 2
| Feature | CosyVoice 2 | CosyVoice 3 |
|---|---|---|
| Speech tokenizer | SenseVoice + FSQ | MinMo + FSQ (multi-task supervised) |
| Token rate | Variable | 25 Hz fixed |
| CFM backbone | U-Net | DiT (Transformer) |
| LM parameters | 0.5B | 0.5B / 1.5B |
| CFM parameters | 100M | 300M |
| Training data | ~10K hours | ~1M hours |
| Languages | ZH, EN | 9 languages + 18+ dialects |
| Post-training | None | DiffRO (differentiable RL) |
| Pronunciation control | Limited | Pinyin + CMU phoneme inpainting |
| Silent token filtering | No | Yes (11 tokens) |
| vLLM support | Yes | Yes |
| Streaming latency | ~150ms | ~150ms |
| License | Apache 2.0 | Apache 2.0 |
Deployment Requirements
CosyVoice 3 is designed for GPU inference but the 0.5B variant can run on consumer hardware.
| Setup | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 4GB (0.5B 4-bit) | 8GB+ |
| RAM | 8GB | 16GB |
| Storage | 5GB (model weights) | 10GB+ |
| Runtime | PyTorch + CUDA | vLLM or TensorRT |
| CPU inference | Possible (slow) | Not recommended |
Quick Start
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
# Zero-shot voice cloning (prompt text with <|endofprompt|> separator)
for i, j in enumerate(cosyvoice.inference_zero_shot(
'Hello, this is a cloned voice.',
'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False
)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# Instruction-controlled synthesis with voice cloning
for i, j in enumerate(cosyvoice.inference_instruct2(
'Welcome to the show.',
'You are a helpful assistant. Speak cheerfully and fast.<|endofprompt|>',
'./asset/zero_shot_prompt.wav', stream=False
)):
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# Pronunciation inpainting (hotfix)
for i, j in enumerate(cosyvoice.inference_zero_shot(
'报道[j][ǐ]予好评。',
'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False
)):
torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)Speech-Swift (macOS / Apple Silicon)
The model is available through soniqo/speech-swift as a Swift package with MLX and CoreML backends:
brew install soniqo/tap/speech
speech speak "Hello world" --voice-sample reference.wav --cosy-instruct "cheerful"On an M2 Max, the 4-bit quantized model runs with RTF ~0.5 (faster than real-time).
What CosyVoice 3 Means for TTS
-
Data scaling works for speech synthesis. CosyVoice 3 provides strong evidence that scaling laws apply to TTS — the jump from 10K to 1M hours produced consistent quality improvements.
-
Speech tokenizers are the bottleneck. The shift from ASR-only tokenizer supervision (CosyVoice 2) to multi-task supervision (CosyVoice 3) directly improved prosody naturalness. Future TTS models will likely use increasingly sophisticated tokenizer training.
-
Differentiable RL is a pragmatic innovation. DiffRO avoids the computational overhead of traditional RL for TTS (running CFM + vocoder for every sample) by operating on tokens. This approach should generalize to other discrete-token speech models.
-
Chinese dialect support is a differentiator. CosyVoice 3’s 18+ Chinese dialect coverage addresses a real market need that most Western TTS systems ignore entirely.
-
The gap with top commercial models is narrowing. CosyVoice 3 + RL approaches Seed-TTS (a proprietary Google model) on content consistency, though speaker similarity still lags. Open-source TTS continues to close the gap.
References
- CosyVoice 3 Technical Report (arXiv)
- GitHub Repository — FunAudioLLM/CosyVoice
- HuggingFace Model — Fun-CosyVoice3-0.5B-2512
- CosyVoice 3 Demo Page
- CosyVoice: A Scalable Multilingual Zero-Shot TTS (v1, arXiv 2407.05407)
- CosyVoice 2: Scalable Streaming Speech Synthesis (arXiv 2412.10117)
- Soniqo Docs — CosyVoice 3 on Apple Silicon
- MinMo: A Multimodal LLM for Seamless Voice Interaction (arXiv 2501.06282)
