Fish Audio released S2 Pro in early 2026, and it immediately claimed the top spot on the TTS Arena leaderboard with an Elo of 1339 — ahead of every commercial API including ElevenLabs, OpenAI TTS, and Google’s Studio model.
The architecture is documented in the Fish Audio S2 Technical Report. This article breaks down how it works, what makes it different, and what it means for the TTS landscape.
Why S2 Pro Matters
S2 Pro is the first open-weight model to consistently beat commercial APIs in blind evaluations. On independent TTS Arena matchups, it wins around 61% of its bouts against other models — including proprietary systems. On Seed-TTS Eval, it achieves the lowest Word Error Rate (WER) ever recorded: 0.54% on Chinese and 0.99% on English, beating Seed-TTS (1.12/2.25), MiniMax-Speech-02 (0.99/1.90), and Qwen3-TTS (0.77/1.24).
| Metric | S2 Pro | Next Best |
|---|---|---|
| Seed-TTS Eval WER (ZH) | 0.54% | 0.77% (Qwen3-TTS) |
| Seed-TTS Eval WER (EN) | 0.99% | 1.24% (Qwen3-TTS) |
| Audio Turing Test | 0.515 | 0.417 (Seed-TTS) |
| EmergentTTS-Eval Win Rate | 81.88% | — |
These are not incremental improvements. S2 Pro represents a step-change in open-source TTS quality.
The Dual-Autoregressive (Dual-AR) Architecture
S2 Pro’s core innovation is its master-slave Dual-AR design — two transformers operating at different granularities, each handling a distinct aspect of speech generation.
Text tokens
│
▼
┌─────────────────────┐
│ Slow AR (4B) │ ← Predicts primary codebook (temporal structure)
│ Decoder-only │ Content, prosody, pacing, emotion
│ Transformer │
└─────────┬───────────┘
│ Primary tokens (coarse)
▼
┌─────────────────────┐
│ Fast AR (400M) │ ← Predicts 9 residual codebooks (acoustic detail)
│ 4-layer │ Timbre, breathiness, articulation, texture
│ Transformer │
└─────────┬───────────┘
│ Full 10-codebook token sequence
▼
┌─────────────────────┐
│ DAC Vocoder │ ← Neural codec decoder → waveform
│ (Descript Audio │
│ Codec) │
└─────────────────────┘Why Two Autoregressive Models?
Standard single-AR TTS models (like early GPT-based approaches) must predict all codebook levels at every time step. This creates a fundamental tension: the coarse semantic structure (what to say, with what prosody) and the fine acoustic detail (exact timbre, articulation nuances) compete for the same modeling capacity.
S2 Pro separates these concerns:
-
Slow AR (4B parameters): Operates along the time axis. At each frame (~21Hz frame rate), it predicts only the primary (first) codebook entry. This is the “what gets said and how” — content, pitch contour, pacing, emotional inflection. The 4B parameter scale gives it the capacity to model long-range temporal dependencies.
-
Fast AR (400M parameters, 4-layer transformer): Operates at each time step predicted by the Slow AR. It takes the primary codebook entry and generates the remaining 9 residual codebooks that reconstruct the full acoustic detail. This is the “what it sounds like” — timbre, breathiness, articulation precision, room characteristics.
The asymmetry is deliberate. The Slow AR needs substantial capacity because it must understand language, prosody, and speaker identity. The Fast AR is a conditional refinement network — given the primary token, predict the residual. It is 10x smaller and correspondingly faster.
Token representation:
Frame rate: ~21 Hz (one frame per ~47ms of audio)
Codebooks: 10 per frame (1 primary + 9 residual)
Frame 1: [c₁₁, c₁₂, c₁₃, ..., c₁₁₀]
Frame 2: [c₂₁, c₂₂, c₂₃, ..., c₂₁₀]
...
Frame N: [cN₁, cN₂, cN₃, ..., cN₁₀]
Slow AR predicts: c₁₁ → c₂₁ → ... → cN₁ (1 codebook per frame)
Fast AR predicts: [c₁₂...c₁₁₀] per frame (9 codebooks per frame)Inference flow:
# Pseudocode for Dual-AR inference
primary_tokens = []
for step in range(max_frames):
logits = slow_ar.forward(text_tokens, primary_tokens, conditioning)
next_primary = sample(logits) # top-p, temperature
primary_tokens.append(next_primary)
full_sequence = []
for step in range(len(primary_tokens)):
residual_logits = fast_ar.forward(primary_tokens[step])
residuals = sample(residual_logits) # 9 codebook entries
full_sequence.append([primary_tokens[step]] + residuals)
waveform = dac_vocoder.decode(full_sequence) # DAC = Descript Audio CodecThe Multi-Stage Training Recipe
The S2 Pro technical report details a sophisticated three-stage training pipeline that would be impractical for most independent teams to replicate. Each stage targets a different capability.
Stage 1: Semantic Pre-Training
The Slow AR is pre-trained on a massive corpus of interleaved text and audio tokens. The training objective is standard autoregressive next-token prediction — predict the next primary codebook token given all previous tokens.
Training data: ~10 million hours of audio, sourced from:
- Public speech datasets (LibriSpeech, Common Voice, VoxPopuli, etc.)
- Web-crawled video transcripts (video captioning pipeline)
- Podcast and audiobook sources
- Synthetic data from caption generation
Key detail — the video captioning pipeline: Raw video is transcribed with a Whisper-based ASR model. Low-quality segments (noise, silence, overlapping speech) are filtered using a voice activity detection (VAD) and quality assessment model. Only segments with clear, single-speaker speech survive.
Stage 2: Speech Captioning and Instruction Tuning
This is the stage that sets S2 Pro apart from earlier models. The team trained a speech captioner — a multimodal model that takes audio as input and produces natural language descriptions of:
- Speaker demographics (gender, approximate age, accent)
- Speaking style (fast, slow, emphatic, whispered)
- Emotional state (excited, sad, angry, neutral)
- Acoustic environment (quiet room, echo, outdoor)
- Voice quality (breathy, nasal, clear, raspy)
These captions are used as conditioning signals during training, enabling the model to associate specific text descriptions with corresponding acoustic features. This is what makes the [tag] system possible at inference time.
Data pipeline for captions:
Raw audio
→ Whisper ASR (transcript)
→ Speech captioner LLM (description)
→ Quality filter (voice quality score, SNR check)
→ Paired training example: (text, caption, audio tokens)Stage 3: Reinforcement Learning Alignment (GRPO)
The final stage uses Group Relative Policy Optimization (GRPO) — a variant of RL that avoids the memory overhead of PPO-style value/critic networks by normalizing rewards within sampled groups. The same model suite used for data cleaning serves as the reward signal.
Multi-dimensional reward signals:
| Reward Dimension | Signal Source | What It Measures |
|---|---|---|
| Semantic accuracy | Whisper ASR WER | Does the generated audio say the right words? |
| Instruction adherence | Speech captioner similarity | Does the audio match the [tag] description? |
| Acoustic preference | Learned preference model | Would a human rate this as natural? |
| Timbre similarity | Speaker verification model | Does the cloned voice match the reference? |
The GRPO update works as follows:
- For each prompt, sample N candidate outputs (N=8 in the paper)
- Compute all reward signals for each candidate
- Normalize rewards within the group (mean = 0, std = 1)
- Update model parameters to favor higher-rewarded candidates
This is computationally expensive (N forward passes per training step) but produces measurable quality improvements in blind evaluations.
Fine-Grained Control via Natural Language Tags
S2 Pro’s most user-facing innovation is the [tag] system — natural language descriptions embedded inline in the text that control prosody, emotion, and speaking style at a sub-word level.
How it works internally:
During training, the speech captions are converted into a structured tag format. The model learns to associate tag sequences with specific acoustic token patterns. At inference time, tags are treated as additional text tokens in the input sequence:
Input: "Hello [excited] this is amazing news [whisper] but keep it quiet"
→ Tokenized as: [text_tokens] + [tag_tokens] + [text_tokens] + ...The Slow AR conditions on both text and tag tokens, learning the mapping between tag descriptions and the resulting acoustic token patterns.
Supported tag categories:
| Category | Examples |
|---|---|
| Speaking style | [whisper], [shouting], [singing], [laughing tone], [professional broadcast tone] |
| Emotion | [excited], [sad], [angry], [surprised], [delight] |
| Vocal effects | [breathy], [vocal fry], [nasal], [clearing throat] |
| Volume | [volume up], [volume down], [low volume] |
| Pacing | [pause], [short pause], [emphasis], [slow down] |
| Non-speech | [inhale], [exhale], [sigh], [laughing], [audience laughter] |
| Free-form | [whisper in small voice], [like a news anchor], [pitch up] |
15,000+ unique tags are supported, including free-form text. The system does not rely on a fixed tag vocabulary — because the speech captioner was trained to produce natural language descriptions, the model generalizes to novel descriptions at inference time.
Voice Cloning Mechanism
S2 Pro supports zero-shot voice cloning from short reference samples (10-30 seconds recommended). The mechanism uses in-context learning rather than a separate speaker encoder.
In-Context Cloning
Reference audio is encoded through the same RVQ codec pipeline used during training, producing a sequence of 10-codebook tokens. These tokens are prepended to the text tokens as an input prefix to the Slow AR:
Input sequence: [ref_audio_tokens] [text_tokens]
│ │
▼ ▼
Slow AR conditions Generates target
on reference voice speech tokensThe Slow AR attends to the reference token patterns to reproduce the speaker’s timbre, prosody, and speaking style. Because the model was trained on millions of speakers, it generalizes to novel voices without any fine-tuning.
Reference processing:
# Conceptual S2 Pro voice cloning
ref_audio_16khz = load_audio("reference.wav", sample_rate=16000)
ref_tokens = codec_encoder(ref_audio_16khz) # RVQ: 10 codebooks × N frames
# Concatenation-based in-context conditioning
input_tokens = [
BOS_TOKEN,
*ref_tokens.flatten(), # Reference acoustic tokens
SPEAKER_SEPARATOR_TOKEN,
*text_tokenizer(target_text),
EOS_TOKEN,
]
# Generate
primary_tokens = slow_ar.generate(input_tokens, max_new_tokens=M)
full_sequence = fast_ar.expand(primary_tokens)
waveform = dac_vocoder.decode(full_sequence)Native Multi-Speaker Generation
A unique capability: S2 Pro can handle reference audio containing multiple speakers. The model processes each speaker’s features via <|speaker:i|> tokens and learns to attribute different voice characteristics to different speaker IDs in the output.
Input reference: "Speaker A says X, Speaker B says Y"
→ Model extracts separate speaker embeddings for A and B
→ Output with `<|speaker:0|>` and `<|speaker:1|>` tags
→ Generates alternating voices in a single inference passThis is useful for dialogue generation, podcast scripting, and multi-character narration.
The RVQ Codec and DAC Vocoder
RVQ Codec Architecture
S2 Pro uses a Residual Vector Quantization (RVQ) audio codec with 10 codebooks operating at ~21 Hz frame rate.
Input audio (16kHz, 16-bit PCM)
→ Encoder (Conv1D stack with downsampling)
→ RVQ quantization:
Level 0: nearest neighbor in codebook 0 → index c₁
Residual: x₁ = input - codebook_0[c₁]
Level 1: nearest neighbor in codebook 1 for x₁ → index c₂
Residual: x₂ = x₁ - codebook_1[c₂]
... up to 10 levels
→ Output: 10 indices per frame [c₁, c₂, ..., c₁₀]
→ Frame rate: ~21 Hz (~47ms per frame)Frame rate detail: With 10 codebooks at ~21 Hz, each second of audio is represented by ~210 discrete tokens. A 30-second reference clip yields approximately 6,300 tokens.
DAC (Descript Audio Codec) Decoder
The final stage uses a DAC (Descript Audio Codec) decoder. It takes the accumulated 10-codebook token sequence and reconstructs the 16kHz audio waveform. Unlike the earlier V1.5 model which used Firefly-GAN, S2 Pro uses the DAC codec for both encoding (audio-to-tokens) and decoding (tokens-to-audio). This is the same family of neural codec used by the broader audio generation ecosystem.
Token sequence [10 × N] → Token embedding lookup
→ Conv1D decoder stack → 16kHz waveformStreaming Inference with SGLang
S2 Pro is structurally isomorphic to a standard language model (decoder-only transformer), which means it can leverage all of SGLang’s inference optimizations:
| Technique | Benefit |
|---|---|
| Continuous Batching | Multiple requests processed concurrently on one GPU |
| Paged KV Cache | Efficient memory management for long generations |
| CUDA Graph | Reduced kernel launch overhead |
| RadixAttention Prefix Caching | Shared prompt prefixes cached across requests (critical for voice cloning, where the reference prompt is reused) |
Performance on a single NVIDIA H200 (141GB HBM3):
| Metric | Value |
|---|---|
| Real-Time Factor (RTF) | 0.195 (~5x faster than real-time) |
| Time-to-First-Audio (TTFA) | ~100ms |
| Extreme throughput | 3,000+ acoustic tokens/s at RTF < 0.5 |
What RTF 0.195 means: One second of audio is generated in 195ms of compute time. A 10-second voiceover is generated in under 2 seconds.
Benchmark Analysis
Seed-TTS Eval
| Model | WER (ZH) | WER (EN) |
|---|---|---|
| Fish Audio S2 Pro | 0.54% | 0.99% |
| Seed-TTS | 1.12% | 2.25% |
| MiniMax-Speech-02 | 0.99% | 1.90% |
| Qwen3-TTS | 0.77% | 1.24% |
| CosyVoice-300M | 2.24% | 4.31% |
WER (Word Error Rate) measures whether the generated audio is intelligible — low WER means the model says the right words. These numbers are from the Fish Audio S2 Technical Report and GitHub README.
Audio Turing Test
The Audio Turing Test measures how often human listeners mistake synthetic speech for human speech. S2 Pro’s score of 0.515 means listeners rated it as human-level or near-human-level in approximately half of the trials. This surpasses Seed-TTS (0.417) by 24% and MiniMax-Speech (0.387) by 33%.
EmergentTTS-Eval
| Category | Win Rate |
|---|---|
| Paralinguistics | 91.61% |
| Questions | 84.41% |
| Syntactic complexity | 83.39% |
| Overall | 81.88% |
Multilingual Coverage
S2 Pro supports over 80 languages without phonemes or language-specific preprocessing. It learns language identity from the text token stream itself.
Tier breakdown:
| Tier | Languages | Quality Level |
|---|---|---|
| Tier 1 | Japanese, English, Chinese | Best |
| Tier 2 | Korean, Spanish, Portuguese, Arabic, Russian, French, German | Excellent |
| Tier 3+ | Swedish, Italian, Turkish, Norwegian, Dutch, Welsh, Finnish, Polish, Estonian, Hindi, Latin, Urdu, Thai, Vietnamese, Javanese, Bengali, Yoruba, Czech, Swahili, Hebrew, Malay, Ukrainian, Indonesian, Kazakh, Bulgarian, Latvian, Myanmar, Filipino/Tagalog, Slovak, Nepali, Persian, Afrikaans, Greek, Tibetan, Croatian, Romanian, Shona, Maori, Yiddish, Amharic, Belarusian, Khmer, Icelandic, Azerbaijani, Sindhi, Breton, Albanian, Pashto, Mongolian, Haitian Creole, Malayalam, Serbian, Telugu, Georgian, Bosnian, Punjabi, Lithuanian, Kannada, Sinhala, Armenian, Marathi, Assamese, Gujarati, Fo, and 15+ more | Varies by language |
Comparison with Major Competitors
| Feature | S2 Pro | ElevenLabs | OpenAI TTS | Qwen3-TTS | Chatterbox-Turbo |
|---|---|---|---|---|---|
| Open weights | Yes (research license) | No | No | Yes | Yes (MIT) |
| Voice cloning | Zero-shot (10-30s) | Yes | No | Yes (3s) | Yes (5s+) |
| Emotion control | Free-form [tag] |
Limited | No | Basic | 0.25-2.0 scale |
| Languages | 80+ | 29 | Limited | Varies | English |
| Parameters | 4B (Slow) + 400M (Fast) | Proprietary | Proprietary | ~3.5B | 350M-500M |
| RTF (streaming) | 0.195 | Cloud latency | Cloud latency | 0.14 (M4 Pro) | ~0.3 (GPU) |
| Runs offline on Mac | ❌ | ❌ | ❌ | ✅ (via MLX) | ✅ |
| Multi-speaker gen | ✅ Native | ❌ | ❌ | ❌ | ❌ |
| Multi-turn dialog | ✅ Native | ✅ | ❌ | ❌ | ❌ |
| License for commercial use | ❌ (Fish Audio Research License) | ✅ API | ✅ API | Check | ✅ MIT |
Deployment Requirements
S2 Pro is not a model you run on a laptop. The 4B parameter Slow AR requires a high-end GPU for real-time inference.
| Setup | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 24GB | 80GB+ (H200 ideal) |
| RAM | 32GB | 64GB |
| Storage | 30GB (model weights) | 50GB+ (with codec + tools) |
| Runtime | SGLang-Omni + CUDA | SGLang-Omni + CUDA + Flash Attention 3 |
| CPU inference | Not practical | Not practical |
SGLang-Omni server setup:
# Install SGLang-Omni
pip install sglang-omni
# Download model weights
hf download fishaudio/s2-pro
# Start the server with a config file
sgl-omni serve \
--model-path fishaudio/s2-pro \
--config examples/configs/s2pro_tts.yaml \
--port 8000Voice cloning via API:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Text to synthesize",
"references": [{
"audio_path": "/path/to/reference.wav",
"text": "Reference transcript"
}]
}' \
--output output.wavThe server exposes an OpenAI-compatible /v1/audio/speech endpoint. Reference audio KV states are automatically cached by SGLang’s RadixAttention (86.4% average cache hit rate when reusing the same voice).
What S2 Pro Means for the TTS Landscape
-
The quality gap between open and closed TTS has effectively closed. S2 Pro matches or beats every commercial API on blind evaluations. The practical difference now is not quality but infrastructure: commercial APIs offer zero-setup access; S2 Pro requires GPU infrastructure.
-
Instruction-following is the new frontier. S2 Pro’s
[tag]system points toward a future where TTS is controlled through natural language, not sliders or dropdowns. Users describe what they want and the model delivers it. -
Scale still wins. The S2 Pro team trained on millions of hours of data with a multi-stage pipeline costing tens of thousands of GPU-hours. This is not reproducible by individuals. The open-source community benefits from the release of weights, not from the ability to replicate training.
-
Privacy remains a differentiator for local TTS. S2 Pro requires cloud-grade GPUs. For users who need offline, private TTS on consumer hardware, models like Chatterbox-Turbo (350M params, runs on Mac) remain the only option. The tradeoff is quality vs privacy — and that tradeoff has not gone away.
