cosyvoice 3modelsmultilingual ttsvoice cloningalibaba

CosyVoice 3: Multilingual Zero-Shot TTS

A deep technical exploration of Alibaba's CosyVoice 3: the supervised multi-task MinMo+FSQ speech tokenizer at 25 Hz, the three-stage LLM → DiT flow matching → HiFi-GAN pipeline, DiffRO differentiable reward optimization, 1M-hour data scaling across 9 languages and 18+ Chinese dialects, and zero-shot voice cloning.

Updated on May 31, 202614 min read

Alibaba’s Tongyi Lab released CosyVoice 3 in May 2025 as the latest iteration of their open-source speech synthesis family. The technical report (arXiv:2505.17589) describes a model that scales training data from 10,000 to 1,000,000 hours, increases parameters from 0.5B to 1.5B, and introduces a supervised multi-task speech tokenizer plus a novel differentiable reward optimization (DiffRO) method for post-training.

This article breaks down the architecture, the tokenizer design, the three-stage pipeline, the DiffRO post-training method, and what makes CosyVoice 3 different from its predecessors.


Why CosyVoice 3 Matters

CosyVoice 3 is Alibaba’s answer to “in-the-wild” speech generation — handling diverse domains (e-commerce, navigation, finance, education), multiple languages, Chinese dialects, varied text formats, and emotional speech. It builds on CosyVoice 2’s streaming LLM + flow matching architecture but makes four fundamental upgrades:

  1. A new supervised multi-task speech tokenizer — replaces CosyVoice 2’s SenseVoice-based tokenizer with one derived from MinMo, trained on ASR, emotion recognition, language ID, audio event detection, and speaker analysis
  2. DiffRO (Differentiable Reward Optimization) — a post-training method that optimizes speech tokens directly via backprop instead of expensive RL loops
  3. Data scaling — 10K hours → 1M hours across 9 languages and 18+ Chinese dialects
  4. Model scaling — LM 0.5B → 1.5B, CFM renderer 100M → 300M with DiT backbone
Metric CosyVoice 2 (0.5B) CosyVoice 3 (0.5B base) CosyVoice 3 (0.5B + RL)
Seed-TTS Eval CER (ZH) 1.45% 1.21% 0.81%
Seed-TTS Eval WER (EN) 2.57% 2.24% 1.68%
Speaker Similarity (ZH) 75.7% 78.0% 77.4%
Speaker Similarity (EN) 65.9% 71.8% 69.5%

The Three-Stage Pipeline

CosyVoice 3 follows a three-stage architecture that has become the dominant pattern in modern TTS:

Text → LLM Token Generator → DiT Flow Matching → HiFi-GAN Vocoder → Audio

Stage 1: LLM — Speech Token Generation

The language model is a Qwen2.5-0.5B backbone that generates discrete FSQ (Finite Scalar Quantization) speech tokens autoregressively.

Input: text tokens + speaker embedding + instruction tokens
  → Qwen2.5-0.5B (24 layers, 896 hidden dim, GQA 14/2)
  → Autoregressive prediction of FSQ speech tokens at 25 Hz
  → Output: discrete token sequence [t₁, t₂, ..., tₙ]

LLM architecture details:

Parameter Value
Backbone Qwen2.5-0.5B
Layers 24
Hidden dimension 896
Query heads 14
Key/Value heads 2 (GQA)
FSQ vocabulary 6,561
Quantization 4-bit (inference)
Token rate 25 Hz
Sampling Top-k=25, Top-p=0.8, RAS enabled

Repetition Aware Sampling (RAS) — from VALL-E 2 — penalizes tokens that appeared in the last 10 generated positions, preventing repetitive audio artifacts. This is critical for output stability in long-form generation.

Silent token filtering: CosyVoice 3 filters up to 5 consecutive silent FSQ tokens (11 specific token IDs are classified as silence), preventing long pauses in generated speech.

Stage 2: DiT Flow Matching — Mel-Spectrogram Synthesis

CosyVoice 3 replaces the U-Net-style flow matching encoder from CosyVoice 2 with a Diffusion Transformer (DiT) backbone. This is a key architectural change — DiT scales better with compute and eliminates the need for a separate text encoder and length regularization module.

Discrete speech tokens [t₁, t₂, ..., tₙ]
  → Token embeddings
  → DiT (22 layers, 1024 dim, 16 attention heads)
  → AdaLN conditioning (speaker embedding, CFG guidance)
  → Euler ODE solver (10 steps)
  → Mel-spectrogram (80-band)

DiT parameters:

Parameter Value
Layers 22
Dimension 1024
Attention heads 16
Conditioning AdaLN (Adaptive Layer Norm)
Parameters 300M
ODE solver Euler, 10 steps
CFG rate 0.7

The DiT architecture uses AdaLN (Adaptive Layer Normalization) for conditioning — speaker embeddings, style instructions, and classifier-free guidance scale are injected via learned affine transformations at each transformer block. This is more parameter-efficient than cross-attention conditioning.

The frame rate mismatch between speech tokens (25 Hz) and mel features (higher resolution) is handled by a simple interpolation operation — no more complicated text encoders or length regulators.

Stage 3: HiFi-GAN Vocoder — Waveform Generation

The final stage uses a Neural Source Filter (NSF) HiFi-GAN vocoder to convert mel-spectrograms to 24 kHz waveforms.

Mel-spectrogram (80-band)
  → NSF HiFi-GAN (8 harmonics, upsample ratio 480×)
  → Inverse STFT (n_fft=16, hop=4)
  → 24 kHz audio waveform
Parameter Value
Harmonics 8
Upsample ratio 480×
ISTFT n_fft=16, hop=4
Output sample rate 24 kHz

The Supervised Multi-Task Speech Tokenizer

CosyVoice 3’s most important innovation is its speech tokenizer. Unlike CosyVoice 2 which inserted an FSQ module into SenseVoice-Large (an ASR model), CosyVoice 3 uses MinMo — a multimodal LLM trained on 1.4M+ hours of speech with SOTA performance on spoken dialogue, multilingual ASR, and emotion recognition.

Tokenizer Architecture

Input speech X
  → Voice Encoder 1 (12 Transformer blocks with RoPE)
  → FSQ module:
      - Project to D-dimensional low-rank space
      - Bounded round operation ROUND (quantize to [-K, K])
      - Project back to original dimension
      - Compute index from quantized values
  → Voice Encoder 2 + MinMo LLM (training only)
  → Multi-task supervision: ASR, LID, SER, AED, SA

The FSQ module is simpler than traditional RVQ (Residual Vector Quantization). It projects into a low-dimensional space, applies a bounded round operation for quantization, and computes a single index. This produces a 25 Hz token stream — 25 discrete tokens per second of audio.

Multi-Task Supervision

The tokenizer is trained on ~530,000 hours of speech with five supervision signals:

Task Label Purpose
ASR Text transcription Capture phonetic content
Language ID Language label Enable multilingual tokens
Speech Emotion Recognition Emotion label Preserve paralinguistic cues
Audio Event Detection Event labels Detect non-speech events
Speaker Analysis Speaker identity Preserve voice characteristics

The key insight: by supervising the tokenizer on multiple speech understanding tasks simultaneously, the discrete tokens learn to retain semantic, paralinguistic, and speaker information — not just acoustic reconstruction. This is what enables CosyVoice 3’s improved prosody naturalness compared to CosyVoice 2.

Comparison of tokenizer approaches:

Approach Base Model Token Rate Supervision
CosyVoice v1 Custom encoder Variable Semantic only
CosyVoice 2 SenseVoice-Large + FSQ Variable ASR only
CosyVoice 3 MinMo + FSQ 25 Hz fixed ASR + LID + SER + AED + SA

DiffRO: Differentiable Reward Optimization

Post-training TTS models with reinforcement learning is difficult because the audio must go through the CFM model and vocoder before a reward can be computed — and those downstream models are computationally expensive. Worse, after processing, all audio samples sound similar, making it hard for a reward model to distinguish good from bad.

CosyVoice 3’s DiffRO avoids this entirely by operating directly on speech tokens.

How DiffRO Works

1. Train a Token2Text model (ASR-like) on speech token sequences
   → Given tokens, predict text posterior probabilities

2. For each training prompt:
   a. LLM generates candidate speech token sequences
   b. Token2Text computes reward = P(correct_text | tokens)
   c. Gumbel-Softmax makes the token sampling differentiable
   d. Backprop through the LLM to maximize reward

3. Add KL divergence on token logits to prevent drift
   from the reference model

Why Gumbel-Softmax? Speech token sampling is normally a discrete operation (argmax or categorical sampling) — not differentiable. Gumbel-Softmax replaces this with a continuous relaxation that can be annealed toward true categorical sampling during training, enabling standard backpropagation.

Multi-task rewards: DiffRO can incorporate multiple reward signals:

Reward Signal Source
Content accuracy Token2Text posterior
Emotion accuracy Emotion classifier on tokens
Audio quality MOS prediction model
Instruction adherence Style classifier

The KL divergence term keeps the post-trained model from collapsing or drifting too far:

Loss = -Reward + β × KL(π_post || π_ref)

Where β controls the strength of the KL penalty.


Data Scaling and Multilingual Coverage

CosyVoice 3’s training data expansion from 10K to 1M hours involved a sophisticated data pipeline:

Data Pipeline

Raw in-the-wild audio (web, podcasts, video)
  → ASR transcription (multiple models)
  → Pairwise WER filtering (< 15% disagreement)
  → Forced alignment for punctuation adjustment
      (add comma if pause > 300ms, remove if < 50ms)
  → Volume normalization
  → Speech/text length ratio filtering (remove bottom 1%, top 5%)
  → Clean paired dataset

Language Coverage

Category Coverage
Primary languages Chinese, English, Japanese, Korean, Russian, French, German, Spanish, Italian
Chinese dialects 18+ (Guangdong/Cantonese, Minnan, Sichuan, Dongbei, Shanxi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.)
Text formats Standard, normalized, inverse-normalized, mixed phoneme/character
Domains E-commerce, navigation, finance, education, conversation, speech, singing

Self-training: An early version of CosyVoice 3 was used to generate synthetic data for rare cases (unusual text formats, edge-case pronunciations), then those generations were filtered and added to the training set.

Model Scaling

Component CosyVoice 2 CosyVoice 3
LLM parameters 0.5B 0.5B base / 1.5B
CFM parameters 100M (U-Net) 300M (DiT)
Training data ~10K hours ~1M hours
Languages ZH, EN 9 languages, 18+ dialects

Voice Cloning Mechanism

CosyVoice 3 supports zero-shot voice cloning by passing a reference audio sample alongside the text. The model encodes the reference through its own pipeline (the Soniqo macOS port uses a CAM++ speaker encoder for this step).

How Cloning Works

Reference audio (10-20 seconds)
  → Speaker encoder (e.g., CAM++ in Soniqo port)
  → 192-dim speaker embedding
  → Affine projection (192 → 80)
  → Conditions DiT flow matching via AdaLN

The speaker embedding conditions the DiT flow matching decoder, not the LLM. This means the LLM generates content-appropriate tokens, and the voice timbre is injected during the mel-spectrogram synthesis stage.

Property (Soniqo port) Value
Model CAM++ (Context-Aware Masking++)
Embedding 192 dimensions
Backend CoreML (Neural Engine)
Size ~14 MB

Control Tokens

Internally, CosyVoice 3 uses <|fl_*|> tokens to switch between modes. The API methods (inference_zero_shot, inference_instruct2, etc.) emit these automatically — users never write them by hand:

| Token | Mode | API Method | | ----- | -------------------- | ---------- | ------------------------------ | --------------------------------------------- | | < | fl_speaker_clone | > | Zero-shot voice cloning | inference_zero_shot() | | < | fl_speaker_instruct | > | Instruction-only synthesis | inference_instruct2() with instruction text | | < | fl_speaker_instruct2 | > | Instruction + cloning combined | inference_instruct2() with --voice-sample | | < | fl_save_speaker | > | Persist speaker embedding | add_zero_shot_spk() |

Cross-Lingual Voice Cloning

Because the tokenizer was trained on multiple languages, CosyVoice 3 can clone a voice from one language and synthesize speech in another — e.g., clone a Japanese speaker and generate English or Chinese audio with the same voice characteristics. For Japanese synthesis, the text must be transcribed to katakana first.


Instruction Control and Pronunciation Inpainting

Instruction Tags

CosyVoice 3 supports natural language instructions passed via the instruct_text parameter. The instruction is placed before the <|endofprompt|> separator:

instruct_text = "Speak cheerfully and quickly <|endofprompt|>"
     → LLM conditions on "cheerful" and "fast" prosody
     → DiT applies corresponding style conditioning
     → tts_text argument contains the actual text to speak

Alternatively, inline style tags like [laugh] and [breath] can be embedded directly in the text, supported in both inference_cross_lingual and inference_zero_shot modes.

Supported instruction dimensions:

  • Language/dialect selection
  • Emotion (happy, sad, angry, excited, gentle)
  • Speaking rate (slow, fast)
  • Volume (loud, soft)
  • Style (broadcast, conversational, narrative)

Pronunciation Inpainting

A production-oriented feature: CosyVoice 3 supports pronunciation inpainting by inserting Chinese Pinyin in square brackets after the target character:

Input: "高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。"

Here [j][ǐ] overrides the default pronunciation of the preceding character. English CMU phonemes can be used similarly. This gives fine-grained control over rare words, proper nouns, and technical terms without retraining.


Bi-Streaming Inference

CosyVoice 3 supports both text-in streaming and audio-out streaming, achieving 150ms first-chunk latency.

User types text character by character
  → LLM generates speech tokens incrementally
  → DiT flow matching processes in chunks
  → HiFi-GAN outputs audio frames as they're ready
  → Audio plays back with ~150ms startup delay

The streaming architecture uses:

  • KV cache — cached transformer key/value states avoid recomputation
  • SDPA (Scaled Dot-Product Attention) — efficient attention for incremental decoding
  • Chunk-aware flow matching — processes audio in overlapping windows
  • Lookahead mechanismpre_lookahead_len parameter controls quality/latency tradeoff

Deployment Backends

Backend Target Speedup Platform CV3 Support
TensorRT Flow decoder ODE solver ~4× GPU
vLLM LLM module 2-3× batch GPU
TensorRT-LLM Full pipeline ~4× LLM GPU

Note: TorchScript JIT is not available for CosyVoice 3 due to DiT architecture incompatibility.


Benchmark Results

Seed-TTS Eval

Model CER (ZH) ↓ Speaker Sim (ZH) ↑ WER (EN) ↓ Speaker Sim (EN) ↑
Human 1.26 75.5 2.14 73.4
Seed-TTS 1.12 79.6 2.25 76.2
CosyVoice 2 (0.5B) 1.45 75.7 2.57 65.9
Fun-CosyVoice3 (0.5B) 1.21 78.0 2.24 71.8
Fun-CosyVoice3 + RL 0.81 77.4 1.68 69.5

Hard Test Set

Model CER ↓ Speaker Sim ↑
CosyVoice 2 6.83 72.4
CosyVoice 3 (0.5B) 6.71 75.8
CosyVoice 3 + RL 5.44 75.0

The hard test set includes challenging in-the-wild samples with background noise, varied recording conditions, and non-standard speech — CosyVoice 3’s improved tokenizer and DiT backbone handle this substantially better than CosyVoice 2.


Comparison with CosyVoice 2

Feature CosyVoice 2 CosyVoice 3
Speech tokenizer SenseVoice + FSQ MinMo + FSQ (multi-task supervised)
Token rate Variable 25 Hz fixed
CFM backbone U-Net DiT (Transformer)
LM parameters 0.5B 0.5B / 1.5B
CFM parameters 100M 300M
Training data ~10K hours ~1M hours
Languages ZH, EN 9 languages + 18+ dialects
Post-training None DiffRO (differentiable RL)
Pronunciation control Limited Pinyin + CMU phoneme inpainting
Silent token filtering No Yes (11 tokens)
vLLM support Yes Yes
Streaming latency ~150ms ~150ms
License Apache 2.0 Apache 2.0

Deployment Requirements

CosyVoice 3 is designed for GPU inference but the 0.5B variant can run on consumer hardware.

Setup Minimum Recommended
GPU VRAM 4GB (0.5B 4-bit) 8GB+
RAM 8GB 16GB
Storage 5GB (model weights) 10GB+
Runtime PyTorch + CUDA vLLM or TensorRT
CPU inference Possible (slow) Not recommended

Quick Start

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')

# Zero-shot voice cloning (prompt text with <|endofprompt|> separator)
for i, j in enumerate(cosyvoice.inference_zero_shot(
    'Hello, this is a cloned voice.',
    'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
    './asset/zero_shot_prompt.wav', stream=False
)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# Instruction-controlled synthesis with voice cloning
for i, j in enumerate(cosyvoice.inference_instruct2(
    'Welcome to the show.',
    'You are a helpful assistant. Speak cheerfully and fast.<|endofprompt|>',
    './asset/zero_shot_prompt.wav', stream=False
)):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# Pronunciation inpainting (hotfix)
for i, j in enumerate(cosyvoice.inference_zero_shot(
    '报道[j][ǐ]予好评。',
    'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
    './asset/zero_shot_prompt.wav', stream=False
)):
    torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

Speech-Swift (macOS / Apple Silicon)

The model is available through soniqo/speech-swift as a Swift package with MLX and CoreML backends:

brew install soniqo/tap/speech
speech speak "Hello world" --voice-sample reference.wav --cosy-instruct "cheerful"

On an M2 Max, the 4-bit quantized model runs with RTF ~0.5 (faster than real-time).


What CosyVoice 3 Means for TTS

  1. Data scaling works for speech synthesis. CosyVoice 3 provides strong evidence that scaling laws apply to TTS — the jump from 10K to 1M hours produced consistent quality improvements.

  2. Speech tokenizers are the bottleneck. The shift from ASR-only tokenizer supervision (CosyVoice 2) to multi-task supervision (CosyVoice 3) directly improved prosody naturalness. Future TTS models will likely use increasingly sophisticated tokenizer training.

  3. Differentiable RL is a pragmatic innovation. DiffRO avoids the computational overhead of traditional RL for TTS (running CFM + vocoder for every sample) by operating on tokens. This approach should generalize to other discrete-token speech models.

  4. Chinese dialect support is a differentiator. CosyVoice 3’s 18+ Chinese dialect coverage addresses a real market need that most Western TTS systems ignore entirely.

  5. The gap with top commercial models is narrowing. CosyVoice 3 + RL approaches Seed-TTS (a proprietary Google model) on content consistency, though speaker similarity still lags. Open-source TTS continues to close the gap.


References

More from the blog