CosyVoice 3: Multilingual Zero-Shot TTS

Alibaba’s Tongyi Lab released CosyVoice 3 in May 2025 as the latest iteration of their open-source speech synthesis family. The technical report (arXiv:2505.17589) describes a model that scales training data from 10,000 to 1,000,000 hours, increases parameters from 0.5B to 1.5B, and introduces a supervised multi-task speech tokenizer plus a novel differentiable reward optimization (DiffRO) method for post-training.

This article breaks down the architecture, the tokenizer design, the three-stage pipeline, the DiffRO post-training method, and what makes CosyVoice 3 different from its predecessors.

Why CosyVoice 3 Matters

CosyVoice 3 is Alibaba’s answer to “in-the-wild” speech generation — handling diverse domains (e-commerce, navigation, finance, education), multiple languages, Chinese dialects, varied text formats, and emotional speech. It builds on CosyVoice 2’s streaming LLM + flow matching architecture but makes four fundamental upgrades:

A new supervised multi-task speech tokenizer — replaces CosyVoice 2’s SenseVoice-based tokenizer with one derived from MinMo, trained on ASR, emotion recognition, language ID, audio event detection, and speaker analysis
DiffRO (Differentiable Reward Optimization) — a post-training method that optimizes speech tokens directly via backprop instead of expensive RL loops
Data scaling — 10K hours → 1M hours across 9 languages and 18+ Chinese dialects
Model scaling — LM 0.5B → 1.5B, CFM renderer 100M → 300M with DiT backbone

Metric	CosyVoice 2 (0.5B)	CosyVoice 3 (0.5B base)	CosyVoice 3 (0.5B + RL)
Seed-TTS Eval CER (ZH)	1.45%	1.21%	0.81%
Seed-TTS Eval WER (EN)	2.57%	2.24%	1.68%
Speaker Similarity (ZH)	75.7%	78.0%	77.4%
Speaker Similarity (EN)	65.9%	71.8%	69.5%

The Three-Stage Pipeline

CosyVoice 3 follows a three-stage architecture that has become the dominant pattern in modern TTS:

Text → LLM Token Generator → DiT Flow Matching → HiFi-GAN Vocoder → Audio

Stage 1: LLM — Speech Token Generation

The language model is a Qwen2.5-0.5B backbone that generates discrete FSQ (Finite Scalar Quantization) speech tokens autoregressively.

Input: text tokens + speaker embedding + instruction tokens
  → Qwen2.5-0.5B (24 layers, 896 hidden dim, GQA 14/2)
  → Autoregressive prediction of FSQ speech tokens at 25 Hz
  → Output: discrete token sequence [t₁, t₂, ..., tₙ]

LLM architecture details:

Parameter	Value
Backbone	Qwen2.5-0.5B
Layers	24
Hidden dimension	896
Query heads	14
Key/Value heads	2 (GQA)
FSQ vocabulary	6,561
Quantization	4-bit (inference)
Token rate	25 Hz
Sampling	Top-k=25, Top-p=0.8, RAS enabled

Repetition Aware Sampling (RAS) — from VALL-E 2 — penalizes tokens that appeared in the last 10 generated positions, preventing repetitive audio artifacts. This is critical for output stability in long-form generation.

Silent token filtering: CosyVoice 3 filters up to 5 consecutive silent FSQ tokens (11 specific token IDs are classified as silence), preventing long pauses in generated speech.

Stage 2: DiT Flow Matching — Mel-Spectrogram Synthesis

CosyVoice 3 replaces the U-Net-style flow matching encoder from CosyVoice 2 with a Diffusion Transformer (DiT) backbone. This is a key architectural change — DiT scales better with compute and eliminates the need for a separate text encoder and length regularization module.

Discrete speech tokens [t₁, t₂, ..., tₙ]
  → Token embeddings
  → DiT (22 layers, 1024 dim, 16 attention heads)
  → AdaLN conditioning (speaker embedding, CFG guidance)
  → Euler ODE solver (10 steps)
  → Mel-spectrogram (80-band)

DiT parameters:

Parameter	Value
Layers	22
Dimension	1024
Attention heads	16
Conditioning	AdaLN (Adaptive Layer Norm)
Parameters	300M
ODE solver	Euler, 10 steps
CFG rate	0.7

The DiT architecture uses AdaLN (Adaptive Layer Normalization) for conditioning — speaker embeddings, style instructions, and classifier-free guidance scale are injected via learned affine transformations at each transformer block. This is more parameter-efficient than cross-attention conditioning.

The frame rate mismatch between speech tokens (25 Hz) and mel features (higher resolution) is handled by a simple interpolation operation — no more complicated text encoders or length regulators.

Stage 3: HiFi-GAN Vocoder — Waveform Generation

The final stage uses a Neural Source Filter (NSF) HiFi-GAN vocoder to convert mel-spectrograms to 24 kHz waveforms.

Mel-spectrogram (80-band)
  → NSF HiFi-GAN (8 harmonics, upsample ratio 480×)
  → Inverse STFT (n_fft=16, hop=4)
  → 24 kHz audio waveform

Parameter	Value
Harmonics	8
Upsample ratio	480×
ISTFT	n_fft=16, hop=4
Output sample rate	24 kHz

The Supervised Multi-Task Speech Tokenizer

CosyVoice 3’s most important innovation is its speech tokenizer. Unlike CosyVoice 2 which inserted an FSQ module into SenseVoice-Large (an ASR model), CosyVoice 3 uses MinMo — a multimodal LLM trained on 1.4M+ hours of speech with SOTA performance on spoken dialogue, multilingual ASR, and emotion recognition.

Tokenizer Architecture

Input speech X
  → Voice Encoder 1 (12 Transformer blocks with RoPE)
  → FSQ module:
      - Project to D-dimensional low-rank space
      - Bounded round operation ROUND (quantize to [-K, K])
      - Project back to original dimension
      - Compute index from quantized values
  → Voice Encoder 2 + MinMo LLM (training only)
  → Multi-task supervision: ASR, LID, SER, AED, SA

The FSQ module is simpler than traditional RVQ (Residual Vector Quantization). It projects into a low-dimensional space, applies a bounded round operation for quantization, and computes a single index. This produces a 25 Hz token stream — 25 discrete tokens per second of audio.

Multi-Task Supervision

The tokenizer is trained on ~530,000 hours of speech with five supervision signals:

Task	Label	Purpose
ASR	Text transcription	Capture phonetic content
Language ID	Language label	Enable multilingual tokens
Speech Emotion Recognition	Emotion label	Preserve paralinguistic cues
Audio Event Detection	Event labels	Detect non-speech events
Speaker Analysis	Speaker identity	Preserve voice characteristics

The key insight: by supervising the tokenizer on multiple speech understanding tasks simultaneously, the discrete tokens learn to retain semantic, paralinguistic, and speaker information — not just acoustic reconstruction. This is what enables CosyVoice 3’s improved prosody naturalness compared to CosyVoice 2.

Comparison of tokenizer approaches:

Approach	Base Model	Token Rate	Supervision
CosyVoice v1	Custom encoder	Variable	Semantic only
CosyVoice 2	SenseVoice-Large + FSQ	Variable	ASR only
CosyVoice 3	MinMo + FSQ	25 Hz fixed	ASR + LID + SER + AED + SA

DiffRO: Differentiable Reward Optimization

Post-training TTS models with reinforcement learning is difficult because the audio must go through the CFM model and vocoder before a reward can be computed — and those downstream models are computationally expensive. Worse, after processing, all audio samples sound similar, making it hard for a reward model to distinguish good from bad.

CosyVoice 3’s DiffRO avoids this entirely by operating directly on speech tokens.

How DiffRO Works

1. Train a Token2Text model (ASR-like) on speech token sequences
   → Given tokens, predict text posterior probabilities

2. For each training prompt:
   a. LLM generates candidate speech token sequences
   b. Token2Text computes reward = P(correct_text | tokens)
   c. Gumbel-Softmax makes the token sampling differentiable
   d. Backprop through the LLM to maximize reward

3. Add KL divergence on token logits to prevent drift
   from the reference model

Why Gumbel-Softmax? Speech token sampling is normally a discrete operation (argmax or categorical sampling) — not differentiable. Gumbel-Softmax replaces this with a continuous relaxation that can be annealed toward true categorical sampling during training, enabling standard backpropagation.

Multi-task rewards: DiffRO can incorporate multiple reward signals:

Reward	Signal Source
Content accuracy	Token2Text posterior
Emotion accuracy	Emotion classifier on tokens
Audio quality	MOS prediction model
Instruction adherence	Style classifier

The KL divergence term keeps the post-trained model from collapsing or drifting too far:

Loss = -Reward + β × KL(π_post || π_ref)

Where β controls the strength of the KL penalty.

Data Scaling and Multilingual Coverage

CosyVoice 3’s training data expansion from 10K to 1M hours involved a sophisticated data pipeline:

Data Pipeline

Raw in-the-wild audio (web, podcasts, video)
  → ASR transcription (multiple models)
  → Pairwise WER filtering (< 15% disagreement)
  → Forced alignment for punctuation adjustment
      (add comma if pause > 300ms, remove if < 50ms)
  → Volume normalization
  → Speech/text length ratio filtering (remove bottom 1%, top 5%)
  → Clean paired dataset

Language Coverage

Category	Coverage
Primary languages	Chinese, English, Japanese, Korean, Russian, French, German, Spanish, Italian
Chinese dialects	18+ (Guangdong/Cantonese, Minnan, Sichuan, Dongbei, Shanxi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.)
Text formats	Standard, normalized, inverse-normalized, mixed phoneme/character
Domains	E-commerce, navigation, finance, education, conversation, speech, singing

Self-training: An early version of CosyVoice 3 was used to generate synthetic data for rare cases (unusual text formats, edge-case pronunciations), then those generations were filtered and added to the training set.

Model Scaling

Component	CosyVoice 2	CosyVoice 3
LLM parameters	0.5B	0.5B base / 1.5B
CFM parameters	100M (U-Net)	300M (DiT)
Training data	~10K hours	~1M hours
Languages	ZH, EN	9 languages, 18+ dialects

Voice Cloning Mechanism

CosyVoice 3 supports zero-shot voice cloning by passing a reference audio sample alongside the text. The model encodes the reference through its own pipeline (the Soniqo macOS port uses a CAM++ speaker encoder for this step).

How Cloning Works

Reference audio (10-20 seconds)
  → Speaker encoder (e.g., CAM++ in Soniqo port)
  → 192-dim speaker embedding
  → Affine projection (192 → 80)
  → Conditions DiT flow matching via AdaLN

The speaker embedding conditions the DiT flow matching decoder, not the LLM. This means the LLM generates content-appropriate tokens, and the voice timbre is injected during the mel-spectrogram synthesis stage.

Property (Soniqo port)	Value
Model	CAM++ (Context-Aware Masking++)
Embedding	192 dimensions
Backend	CoreML (Neural Engine)
Size	~14 MB

Control Tokens

Internally, CosyVoice 3 uses <|fl_*|> tokens to switch between modes. The API methods (inference_zero_shot, inference_instruct2, etc.) emit these automatically — users never write them by hand:

| Token | Mode | API Method | | ----- | -------------------- | ---------- | ------------------------------ | --------------------------------------------- | | < | fl_speaker_clone | > | Zero-shot voice cloning | inference_zero_shot() | | < | fl_speaker_instruct | > | Instruction-only synthesis | inference_instruct2() with instruction text | | < | fl_speaker_instruct2 | > | Instruction + cloning combined | inference_instruct2() with --voice-sample | | < | fl_save_speaker | > | Persist speaker embedding | add_zero_shot_spk() |

Cross-Lingual Voice Cloning

Because the tokenizer was trained on multiple languages, CosyVoice 3 can clone a voice from one language and synthesize speech in another — e.g., clone a Japanese speaker and generate English or Chinese audio with the same voice characteristics. For Japanese synthesis, the text must be transcribed to katakana first.

Instruction Control and Pronunciation Inpainting

Instruction Tags

CosyVoice 3 supports natural language instructions passed via the instruct_text parameter. The instruction is placed before the <|endofprompt|> separator:

instruct_text = "Speak cheerfully and quickly <|endofprompt|>"
     → LLM conditions on "cheerful" and "fast" prosody
     → DiT applies corresponding style conditioning
     → tts_text argument contains the actual text to speak

Alternatively, inline style tags like [laugh] and [breath] can be embedded directly in the text, supported in both inference_cross_lingual and inference_zero_shot modes.

Supported instruction dimensions:

Language/dialect selection
Emotion (happy, sad, angry, excited, gentle)
Speaking rate (slow, fast)
Volume (loud, soft)
Style (broadcast, conversational, narrative)

Pronunciation Inpainting

A production-oriented feature: CosyVoice 3 supports pronunciation inpainting by inserting Chinese Pinyin in square brackets after the target character:

Input: "高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。"

Here [j][ǐ] overrides the default pronunciation of the preceding character. English CMU phonemes can be used similarly. This gives fine-grained control over rare words, proper nouns, and technical terms without retraining.

Bi-Streaming Inference

CosyVoice 3 supports both text-in streaming and audio-out streaming, achieving 150ms first-chunk latency.

User types text character by character
  → LLM generates speech tokens incrementally
  → DiT flow matching processes in chunks
  → HiFi-GAN outputs audio frames as they're ready
  → Audio plays back with ~150ms startup delay

The streaming architecture uses:

KV cache — cached transformer key/value states avoid recomputation
SDPA (Scaled Dot-Product Attention) — efficient attention for incremental decoding
Chunk-aware flow matching — processes audio in overlapping windows
Lookahead mechanism — pre_lookahead_len parameter controls quality/latency tradeoff

Deployment Backends

Backend	Target	Speedup	Platform	CV3 Support
TensorRT	Flow decoder ODE solver	~4×	GPU	✅
vLLM	LLM module	2-3× batch	GPU	✅
TensorRT-LLM	Full pipeline	~4× LLM	GPU	✅

Note: TorchScript JIT is not available for CosyVoice 3 due to DiT architecture incompatibility.

Benchmark Results

Seed-TTS Eval

Model	CER (ZH) ↓	Speaker Sim (ZH) ↑	WER (EN) ↓	Speaker Sim (EN) ↑
Human	1.26	75.5	2.14	73.4
Seed-TTS	1.12	79.6	2.25	76.2
CosyVoice 2 (0.5B)	1.45	75.7	2.57	65.9
Fun-CosyVoice3 (0.5B)	1.21	78.0	2.24	71.8
Fun-CosyVoice3 + RL	0.81	77.4	1.68	69.5

Hard Test Set

Model	CER ↓	Speaker Sim ↑
CosyVoice 2	6.83	72.4
CosyVoice 3 (0.5B)	6.71	75.8
CosyVoice 3 + RL	5.44	75.0

The hard test set includes challenging in-the-wild samples with background noise, varied recording conditions, and non-standard speech — CosyVoice 3’s improved tokenizer and DiT backbone handle this substantially better than CosyVoice 2.

Comparison with CosyVoice 2

Feature	CosyVoice 2	CosyVoice 3
Speech tokenizer	SenseVoice + FSQ	MinMo + FSQ (multi-task supervised)
Token rate	Variable	25 Hz fixed
CFM backbone	U-Net	DiT (Transformer)
LM parameters	0.5B	0.5B / 1.5B
CFM parameters	100M	300M
Training data	~10K hours	~1M hours
Languages	ZH, EN	9 languages + 18+ dialects
Post-training	None	DiffRO (differentiable RL)
Pronunciation control	Limited	Pinyin + CMU phoneme inpainting
Silent token filtering	No	Yes (11 tokens)
vLLM support	Yes	Yes
Streaming latency	~150ms	~150ms
License	Apache 2.0	Apache 2.0

Deployment Requirements

CosyVoice 3 is designed for GPU inference but the 0.5B variant can run on consumer hardware.

Setup	Minimum	Recommended
GPU VRAM	4GB (0.5B 4-bit)	8GB+
RAM	8GB	16GB
Storage	5GB (model weights)	10GB+
Runtime	PyTorch + CUDA	vLLM or TensorRT
CPU inference	Possible (slow)	Not recommended

Quick Start

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')

# Zero-shot voice cloning (prompt text with <|endofprompt|> separator)
for i, j in enumerate(cosyvoice.inference_zero_shot(
    'Hello, this is a cloned voice.',
    'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
    './asset/zero_shot_prompt.wav', stream=False
)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# Instruction-controlled synthesis with voice cloning
for i, j in enumerate(cosyvoice.inference_instruct2(
    'Welcome to the show.',
    'You are a helpful assistant. Speak cheerfully and fast.<|endofprompt|>',
    './asset/zero_shot_prompt.wav', stream=False
)):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# Pronunciation inpainting (hotfix)
for i, j in enumerate(cosyvoice.inference_zero_shot(
    '报道[j][ǐ]予好评。',
    'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
    './asset/zero_shot_prompt.wav', stream=False
)):
    torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

Speech-Swift (macOS / Apple Silicon)

The model is available through soniqo/speech-swift as a Swift package with MLX and CoreML backends:

brew install soniqo/tap/speech
speech speak "Hello world" --voice-sample reference.wav --cosy-instruct "cheerful"

On an M2 Max, the 4-bit quantized model runs with RTF ~0.5 (faster than real-time).

What CosyVoice 3 Means for TTS

Data scaling works for speech synthesis. CosyVoice 3 provides strong evidence that scaling laws apply to TTS — the jump from 10K to 1M hours produced consistent quality improvements.
Speech tokenizers are the bottleneck. The shift from ASR-only tokenizer supervision (CosyVoice 2) to multi-task supervision (CosyVoice 3) directly improved prosody naturalness. Future TTS models will likely use increasingly sophisticated tokenizer training.
Differentiable RL is a pragmatic innovation. DiffRO avoids the computational overhead of traditional RL for TTS (running CFM + vocoder for every sample) by operating on tokens. This approach should generalize to other discrete-token speech models.
Chinese dialect support is a differentiator. CosyVoice 3’s 18+ Chinese dialect coverage addresses a real market need that most Western TTS systems ignore entirely.
The gap with top commercial models is narrowing. CosyVoice 3 + RL approaches Seed-TTS (a proprietary Google model) on content consistency, though speaker similarity still lags. Open-source TTS continues to close the gap.