Fish Audio S2 Pro: Benchmark-Leading Voice Cloning

Fish Audio released S2 Pro in early 2026, and it immediately claimed the top spot on the TTS Arena leaderboard with an Elo of 1339 — ahead of every commercial API including ElevenLabs, OpenAI TTS, and Google’s Studio model.

The architecture is documented in the Fish Audio S2 Technical Report. This article breaks down how it works, what makes it different, and what it means for the TTS landscape.

Why S2 Pro Matters

S2 Pro is the first open-weight model to consistently beat commercial APIs in blind evaluations. On independent TTS Arena matchups, it wins around 61% of its bouts against other models — including proprietary systems. On Seed-TTS Eval, it achieves the lowest Word Error Rate (WER) ever recorded: 0.54% on Chinese and 0.99% on English, beating Seed-TTS (1.12/2.25), MiniMax-Speech-02 (0.99/1.90), and Qwen3-TTS (0.77/1.24).

Metric	S2 Pro	Next Best
Seed-TTS Eval WER (ZH)	0.54%	0.77% (Qwen3-TTS)
Seed-TTS Eval WER (EN)	0.99%	1.24% (Qwen3-TTS)
Audio Turing Test	0.515	0.417 (Seed-TTS)
EmergentTTS-Eval Win Rate	81.88%	—

These are not incremental improvements. S2 Pro represents a step-change in open-source TTS quality.

The Dual-Autoregressive (Dual-AR) Architecture

S2 Pro’s core innovation is its master-slave Dual-AR design — two transformers operating at different granularities, each handling a distinct aspect of speech generation.

Text tokens
     │
     ▼
┌─────────────────────┐
│   Slow AR (4B)      │  ← Predicts primary codebook (temporal structure)
│   Decoder-only      │     Content, prosody, pacing, emotion
│   Transformer       │
└─────────┬───────────┘
          │ Primary tokens (coarse)
          ▼
┌─────────────────────┐
│   Fast AR (400M)    │  ← Predicts 9 residual codebooks (acoustic detail)
│   4-layer           │     Timbre, breathiness, articulation, texture
│   Transformer       │
└─────────┬───────────┘
          │ Full 10-codebook token sequence
          ▼
┌─────────────────────┐
│   DAC Vocoder       │  ← Neural codec decoder → waveform
│   (Descript Audio   │
│    Codec)           │
└─────────────────────┘

Why Two Autoregressive Models?

Standard single-AR TTS models (like early GPT-based approaches) must predict all codebook levels at every time step. This creates a fundamental tension: the coarse semantic structure (what to say, with what prosody) and the fine acoustic detail (exact timbre, articulation nuances) compete for the same modeling capacity.

S2 Pro separates these concerns:

Slow AR (4B parameters): Operates along the time axis. At each frame (~21Hz frame rate), it predicts only the primary (first) codebook entry. This is the “what gets said and how” — content, pitch contour, pacing, emotional inflection. The 4B parameter scale gives it the capacity to model long-range temporal dependencies.
Fast AR (400M parameters, 4-layer transformer): Operates at each time step predicted by the Slow AR. It takes the primary codebook entry and generates the remaining 9 residual codebooks that reconstruct the full acoustic detail. This is the “what it sounds like” — timbre, breathiness, articulation precision, room characteristics.

The asymmetry is deliberate. The Slow AR needs substantial capacity because it must understand language, prosody, and speaker identity. The Fast AR is a conditional refinement network — given the primary token, predict the residual. It is 10x smaller and correspondingly faster.

Token representation:

Frame rate: ~21 Hz (one frame per ~47ms of audio)
Codebooks: 10 per frame (1 primary + 9 residual)

Frame 1: [c₁₁, c₁₂, c₁₃, ..., c₁₁₀]
Frame 2: [c₂₁, c₂₂, c₂₃, ..., c₂₁₀]
...
Frame N: [cN₁, cN₂, cN₃, ..., cN₁₀]

Slow AR predicts: c₁₁ → c₂₁ → ... → cN₁ (1 codebook per frame)
Fast AR predicts: [c₁₂...c₁₁₀] per frame (9 codebooks per frame)

Inference flow:

# Pseudocode for Dual-AR inference
primary_tokens = []
for step in range(max_frames):
    logits = slow_ar.forward(text_tokens, primary_tokens, conditioning)
    next_primary = sample(logits)  # top-p, temperature
    primary_tokens.append(next_primary)

full_sequence = []
for step in range(len(primary_tokens)):
    residual_logits = fast_ar.forward(primary_tokens[step])
    residuals = sample(residual_logits)  # 9 codebook entries
    full_sequence.append([primary_tokens[step]] + residuals)

waveform = dac_vocoder.decode(full_sequence)  # DAC = Descript Audio Codec

The Multi-Stage Training Recipe

The S2 Pro technical report details a sophisticated three-stage training pipeline that would be impractical for most independent teams to replicate. Each stage targets a different capability.

Stage 1: Semantic Pre-Training

The Slow AR is pre-trained on a massive corpus of interleaved text and audio tokens. The training objective is standard autoregressive next-token prediction — predict the next primary codebook token given all previous tokens.

Training data: ~10 million hours of audio, sourced from:

Public speech datasets (LibriSpeech, Common Voice, VoxPopuli, etc.)
Web-crawled video transcripts (video captioning pipeline)
Podcast and audiobook sources
Synthetic data from caption generation

Key detail — the video captioning pipeline: Raw video is transcribed with a Whisper-based ASR model. Low-quality segments (noise, silence, overlapping speech) are filtered using a voice activity detection (VAD) and quality assessment model. Only segments with clear, single-speaker speech survive.

Stage 2: Speech Captioning and Instruction Tuning

This is the stage that sets S2 Pro apart from earlier models. The team trained a speech captioner — a multimodal model that takes audio as input and produces natural language descriptions of:

Speaker demographics (gender, approximate age, accent)
Speaking style (fast, slow, emphatic, whispered)
Emotional state (excited, sad, angry, neutral)
Acoustic environment (quiet room, echo, outdoor)
Voice quality (breathy, nasal, clear, raspy)

These captions are used as conditioning signals during training, enabling the model to associate specific text descriptions with corresponding acoustic features. This is what makes the [tag] system possible at inference time.

Data pipeline for captions:

Raw audio
  → Whisper ASR (transcript)
  → Speech captioner LLM (description)
  → Quality filter (voice quality score, SNR check)
  → Paired training example: (text, caption, audio tokens)

Stage 3: Reinforcement Learning Alignment (GRPO)

The final stage uses Group Relative Policy Optimization (GRPO) — a variant of RL that avoids the memory overhead of PPO-style value/critic networks by normalizing rewards within sampled groups. The same model suite used for data cleaning serves as the reward signal.

Multi-dimensional reward signals:

Reward Dimension	Signal Source	What It Measures
Semantic accuracy	Whisper ASR WER	Does the generated audio say the right words?
Instruction adherence	Speech captioner similarity	Does the audio match the `[tag]` description?
Acoustic preference	Learned preference model	Would a human rate this as natural?
Timbre similarity	Speaker verification model	Does the cloned voice match the reference?

The GRPO update works as follows:

For each prompt, sample N candidate outputs (N=8 in the paper)
Compute all reward signals for each candidate
Normalize rewards within the group (mean = 0, std = 1)
Update model parameters to favor higher-rewarded candidates

This is computationally expensive (N forward passes per training step) but produces measurable quality improvements in blind evaluations.

Fine-Grained Control via Natural Language Tags

S2 Pro’s most user-facing innovation is the [tag] system — natural language descriptions embedded inline in the text that control prosody, emotion, and speaking style at a sub-word level.

How it works internally:

During training, the speech captions are converted into a structured tag format. The model learns to associate tag sequences with specific acoustic token patterns. At inference time, tags are treated as additional text tokens in the input sequence:

Input: "Hello [excited] this is amazing news [whisper] but keep it quiet"
     → Tokenized as: [text_tokens] + [tag_tokens] + [text_tokens] + ...

The Slow AR conditions on both text and tag tokens, learning the mapping between tag descriptions and the resulting acoustic token patterns.

Supported tag categories:

Category	Examples
Speaking style	`[whisper]`, `[shouting]`, `[singing]`, `[laughing tone]`, `[professional broadcast tone]`
Emotion	`[excited]`, `[sad]`, `[angry]`, `[surprised]`, `[delight]`
Vocal effects	`[breathy]`, `[vocal fry]`, `[nasal]`, `[clearing throat]`
Volume	`[volume up]`, `[volume down]`, `[low volume]`
Pacing	`[pause]`, `[short pause]`, `[emphasis]`, `[slow down]`
Non-speech	`[inhale]`, `[exhale]`, `[sigh]`, `[laughing]`, `[audience laughter]`
Free-form	`[whisper in small voice]`, `[like a news anchor]`, `[pitch up]`

15,000+ unique tags are supported, including free-form text. The system does not rely on a fixed tag vocabulary — because the speech captioner was trained to produce natural language descriptions, the model generalizes to novel descriptions at inference time.

Voice Cloning Mechanism

S2 Pro supports zero-shot voice cloning from short reference samples (10-30 seconds recommended). The mechanism uses in-context learning rather than a separate speaker encoder.

In-Context Cloning

Reference audio is encoded through the same RVQ codec pipeline used during training, producing a sequence of 10-codebook tokens. These tokens are prepended to the text tokens as an input prefix to the Slow AR:

Input sequence: [ref_audio_tokens] [text_tokens]
                    │                      │
                    ▼                      ▼
               Slow AR conditions     Generates target
               on reference voice     speech tokens

The Slow AR attends to the reference token patterns to reproduce the speaker’s timbre, prosody, and speaking style. Because the model was trained on millions of speakers, it generalizes to novel voices without any fine-tuning.

Reference processing:

# Conceptual S2 Pro voice cloning
ref_audio_16khz = load_audio("reference.wav", sample_rate=16000)
ref_tokens = codec_encoder(ref_audio_16khz)  # RVQ: 10 codebooks × N frames

# Concatenation-based in-context conditioning
input_tokens = [
    BOS_TOKEN,
    *ref_tokens.flatten(),      # Reference acoustic tokens
    SPEAKER_SEPARATOR_TOKEN,
    *text_tokenizer(target_text),
    EOS_TOKEN,
]

# Generate
primary_tokens = slow_ar.generate(input_tokens, max_new_tokens=M)
full_sequence = fast_ar.expand(primary_tokens)
waveform = dac_vocoder.decode(full_sequence)

Native Multi-Speaker Generation

A unique capability: S2 Pro can handle reference audio containing multiple speakers. The model processes each speaker’s features via <|speaker:i|> tokens and learns to attribute different voice characteristics to different speaker IDs in the output.

Input reference: "Speaker A says X, Speaker B says Y"
     → Model extracts separate speaker embeddings for A and B
     → Output with `<|speaker:0|>` and `<|speaker:1|>` tags
     → Generates alternating voices in a single inference pass

This is useful for dialogue generation, podcast scripting, and multi-character narration.

The RVQ Codec and DAC Vocoder

RVQ Codec Architecture

S2 Pro uses a Residual Vector Quantization (RVQ) audio codec with 10 codebooks operating at ~21 Hz frame rate.

Input audio (16kHz, 16-bit PCM)
  → Encoder (Conv1D stack with downsampling)
  → RVQ quantization:
      Level 0: nearest neighbor in codebook 0 → index c₁
      Residual: x₁ = input - codebook_0[c₁]
      Level 1: nearest neighbor in codebook 1 for x₁ → index c₂
      Residual: x₂ = x₁ - codebook_1[c₂]
      ... up to 10 levels
  → Output: 10 indices per frame [c₁, c₂, ..., c₁₀]
  → Frame rate: ~21 Hz (~47ms per frame)

Frame rate detail: With 10 codebooks at ~21 Hz, each second of audio is represented by ~210 discrete tokens. A 30-second reference clip yields approximately 6,300 tokens.

DAC (Descript Audio Codec) Decoder

The final stage uses a DAC (Descript Audio Codec) decoder. It takes the accumulated 10-codebook token sequence and reconstructs the 16kHz audio waveform. Unlike the earlier V1.5 model which used Firefly-GAN, S2 Pro uses the DAC codec for both encoding (audio-to-tokens) and decoding (tokens-to-audio). This is the same family of neural codec used by the broader audio generation ecosystem.

Token sequence [10 × N]  →  Token embedding lookup
  →  Conv1D decoder stack  →  16kHz waveform

Streaming Inference with SGLang

S2 Pro is structurally isomorphic to a standard language model (decoder-only transformer), which means it can leverage all of SGLang’s inference optimizations:

Technique	Benefit
Continuous Batching	Multiple requests processed concurrently on one GPU
Paged KV Cache	Efficient memory management for long generations
CUDA Graph	Reduced kernel launch overhead
RadixAttention Prefix Caching	Shared prompt prefixes cached across requests (critical for voice cloning, where the reference prompt is reused)

Performance on a single NVIDIA H200 (141GB HBM3):

Metric	Value
Real-Time Factor (RTF)	0.195 (~5x faster than real-time)
Time-to-First-Audio (TTFA)	~100ms
Extreme throughput	3,000+ acoustic tokens/s at RTF < 0.5

What RTF 0.195 means: One second of audio is generated in 195ms of compute time. A 10-second voiceover is generated in under 2 seconds.

Benchmark Analysis

Seed-TTS Eval

Model	WER (ZH)	WER (EN)
Fish Audio S2 Pro	0.54%	0.99%
Seed-TTS	1.12%	2.25%
MiniMax-Speech-02	0.99%	1.90%
Qwen3-TTS	0.77%	1.24%
CosyVoice-300M	2.24%	4.31%

WER (Word Error Rate) measures whether the generated audio is intelligible — low WER means the model says the right words. These numbers are from the Fish Audio S2 Technical Report and GitHub README.

Audio Turing Test

The Audio Turing Test measures how often human listeners mistake synthetic speech for human speech. S2 Pro’s score of 0.515 means listeners rated it as human-level or near-human-level in approximately half of the trials. This surpasses Seed-TTS (0.417) by 24% and MiniMax-Speech (0.387) by 33%.

EmergentTTS-Eval

Category	Win Rate
Paralinguistics	91.61%
Questions	84.41%
Syntactic complexity	83.39%
Overall	81.88%

Multilingual Coverage

S2 Pro supports over 80 languages without phonemes or language-specific preprocessing. It learns language identity from the text token stream itself.

Tier breakdown:

Tier	Languages	Quality Level
Tier 1	Japanese, English, Chinese	Best
Tier 2	Korean, Spanish, Portuguese, Arabic, Russian, French, German	Excellent
Tier 3+	Swedish, Italian, Turkish, Norwegian, Dutch, Welsh, Finnish, Polish, Estonian, Hindi, Latin, Urdu, Thai, Vietnamese, Javanese, Bengali, Yoruba, Czech, Swahili, Hebrew, Malay, Ukrainian, Indonesian, Kazakh, Bulgarian, Latvian, Myanmar, Filipino/Tagalog, Slovak, Nepali, Persian, Afrikaans, Greek, Tibetan, Croatian, Romanian, Shona, Maori, Yiddish, Amharic, Belarusian, Khmer, Icelandic, Azerbaijani, Sindhi, Breton, Albanian, Pashto, Mongolian, Haitian Creole, Malayalam, Serbian, Telugu, Georgian, Bosnian, Punjabi, Lithuanian, Kannada, Sinhala, Armenian, Marathi, Assamese, Gujarati, Fo, and 15+ more	Varies by language

Comparison with Major Competitors

Feature	S2 Pro	ElevenLabs	OpenAI TTS	Qwen3-TTS	Chatterbox-Turbo
Open weights	Yes (research license)	No	No	Yes	Yes (MIT)
Voice cloning	Zero-shot (10-30s)	Yes	No	Yes (3s)	Yes (5s+)
Emotion control	Free-form `[tag]`	Limited	No	Basic	0.25-2.0 scale
Languages	80+	29	Limited	Varies	English
Parameters	4B (Slow) + 400M (Fast)	Proprietary	Proprietary	~3.5B	350M-500M
RTF (streaming)	0.195	Cloud latency	Cloud latency	0.14 (M4 Pro)	~0.3 (GPU)
Runs offline on Mac	❌	❌	❌	✅ (via MLX)	✅
Multi-speaker gen	✅ Native	❌	❌	❌	❌
Multi-turn dialog	✅ Native	✅	❌	❌	❌
License for commercial use	❌ (Fish Audio Research License)	✅ API	✅ API	Check	✅ MIT

Deployment Requirements

S2 Pro is not a model you run on a laptop. The 4B parameter Slow AR requires a high-end GPU for real-time inference.

Setup	Minimum	Recommended
GPU VRAM	24GB	80GB+ (H200 ideal)
RAM	32GB	64GB
Storage	30GB (model weights)	50GB+ (with codec + tools)
Runtime	SGLang-Omni + CUDA	SGLang-Omni + CUDA + Flash Attention 3
CPU inference	Not practical	Not practical

SGLang-Omni server setup:

# Install SGLang-Omni
pip install sglang-omni

# Download model weights
hf download fishaudio/s2-pro

# Start the server with a config file
sgl-omni serve \
  --model-path fishaudio/s2-pro \
  --config examples/configs/s2pro_tts.yaml \
  --port 8000

Voice cloning via API:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Text to synthesize",
    "references": [{
      "audio_path": "/path/to/reference.wav",
      "text": "Reference transcript"
    }]
  }' \
  --output output.wav

The server exposes an OpenAI-compatible /v1/audio/speech endpoint. Reference audio KV states are automatically cached by SGLang’s RadixAttention (86.4% average cache hit rate when reusing the same voice).

What S2 Pro Means for the TTS Landscape

The quality gap between open and closed TTS has effectively closed. S2 Pro matches or beats every commercial API on blind evaluations. The practical difference now is not quality but infrastructure: commercial APIs offer zero-setup access; S2 Pro requires GPU infrastructure.
Instruction-following is the new frontier. S2 Pro’s [tag] system points toward a future where TTS is controlled through natural language, not sliders or dropdowns. Users describe what they want and the model delivers it.
Scale still wins. The S2 Pro team trained on millions of hours of data with a multi-stage pipeline costing tens of thousands of GPU-hours. This is not reproducible by individuals. The open-source community benefits from the release of weights, not from the ability to replicate training.
Privacy remains a differentiator for local TTS. S2 Pro requires cloud-grade GPUs. For users who need offline, private TTS on consumer hardware, models like Chatterbox-Turbo (350M params, runs on Mac) remain the only option. The tradeoff is quality vs privacy — and that tradeoff has not gone away.