Qwen3-TTS and Fish Audio S2 Pro represent two different answers to the same question: how do you build the best open-weight text-to-speech system? Both are at the top of the TTS leaderboards in 2026, but they take fundamentally different architectural paths to get there.
This comparison covers their architectures, tokenizer design, voice cloning pipelines, streaming capabilities, multilingual support, training data, inference performance, and licensing — then answers which one you should use and why.
Architecture philosophy
The fundamental difference is how each model decomposes the speech generation problem.
Qwen3-TTS uses a dual-track language model that processes text and speech tokens in parallel within a single architecture. Its 12Hz tokenizer uses a 16-layer residual vector quantization (RVQ) scheme where the first codebook encodes semantic content (guided by a WavLM teacher) and the remaining 15 layers capture acoustic detail. A lightweight causal ConvNet decodes codes directly to waveform — no vocoder, no flow matching, no separate modules.
Fish Audio S2 Pro uses a master-slave dual-autoregressive (Dual-AR) design: a 4B parameter Slow AR predicts the primary codebook at ~21Hz (temporal structure, content, prosody), a 400M parameter Fast AR predicts the remaining 9 residual codebooks per frame (acoustic detail, timbre), and a DAC vocoder reconstructs the waveform.
| Aspect | Qwen3-TTS | Fish Audio S2 Pro |
|---|---|---|
| Paradigm | Dual-track LM (parallel text+speech) | Dual-AR master-slave (Slow + Fast AR) |
| Speech representation | 12Hz 16-layer RVQ multi-codebook | ~21Hz 10-layer RVQ multi-codebook |
| Decoder | Causal ConvNet (direct waveform) | DAC (Descript Audio Codec) vocoder |
| Architecture complexity | 1 unified model | 2 autoregressive LMs + codec |
| Parameter count | 600M / 1.7B | 4B (Slow) + 400M (Fast) |
Qwen3-TTS keeps the entire pipeline in a single model with a lightweight decoder. S2 Pro separates temporal and acoustic modeling into two transformers of very different sizes.
Tokenizer design
Qwen3-TTS’s 12Hz tokenizer operates at 12.5 frames per second (one frame per 80ms). Each frame encodes 16 codebook indices: the first is semantically supervised via WavLM, and the remaining 15 are RVQ residual layers. This hierarchical design means a single frame encodes both “what was said” and “how it sounded” in a dense representation.
S2 Pro’s RVQ codec operates at ~21 Hz (one frame per ~47ms). Each frame encodes 10 codebook indices. Unlike Qwen3-TTS, there is no explicit semantic supervision — the primary codebook learns temporal structure implicitly through autoregressive training on a massive text+speech corpus.
| Tokenizer metric | Qwen3-TTS (12Hz) | S2 Pro |
|---|---|---|
| Frame rate | 12.5 Hz (80ms) | ~21 Hz (~47ms) |
| Codebooks per frame | 16 | 10 |
| Semantic supervision | WavLM teacher (codebook 0) | Implicit (AR training) |
| Reported PESQ | 3.68 (narrowband) | Not directly reported |
| Reported STOI | 0.96 | Not directly reported |
The practical implications:
- S2 Pro’s higher frame rate (21 vs 12.5 Hz) gives finer temporal resolution, which likely contributes to its top TTS Arena ranking.
- Qwen3-TTS’s 16-layer codebook per frame packs more information per time step, reducing the number of autoregressive predictions needed per second.
- Qwen3-TTS’s causal ConvNet decoder is simpler and faster than S2 Pro’s DAC vocoder pipeline.
Edge: Tie. Different design philosophies, both effective in their respective quality benchmarks.
Voice cloning
Qwen3-TTS supports zero-shot voice cloning from approximately 3 seconds of reference audio. The reference is encoded through the Qwen-TTS-Tokenizer, a learnable speaker encoder extracts an embedding, and that embedding conditions the dual-track LM during generation.
S2 Pro uses in-context voice cloning rather than a separate speaker encoder. Reference audio is encoded through the same RVQ codec and prepended as a token prefix to the Slow AR input. This means the model conditions directly on the reference token patterns rather than a compressed embedding. Recommended reference length is 10-30 seconds.
| Cloning aspect | Qwen3-TTS | S2 Pro |
|---|---|---|
| Method | Learnable speaker encoder | In-context token prefix |
| Min reference | ~3 seconds | 10-30 seconds (recommended) |
| Output sample rate | 24 kHz | 44.1 kHz |
The 3-second threshold gives Qwen3-TTS an edge for quick cloning scenarios, but S2 Pro’s in-context approach means longer references (which carry more speaker information) are naturally handled without architectural changes.
Edge: Qwen3-TTS for short-reference scenarios; S2 Pro for high-fidelity cloning from longer samples.
Voice Design vs Natural language tags
Qwen3-TTS VoiceDesign accepts natural language descriptions like “a warm, middle-aged female voice with a gentle tone, suitable for bedtime stories” and generates speech matching that description — no reference audio needed. The voice description is treated as a system prompt that modulates the speaker embedding and prosody conditioning.
S2 Pro’s [tag] system offers fine-grained inline control over prosody, emotion, and speaking style at the sub-word level. Tags like [whisper], [excited], [slow down] are embedded in the text stream and condition the Slow AR. Over 15,000 unique tags are supported, including free-form text.
These are different capabilities aimed at different use cases:
- Voice Design creates entire voices from description. You want a “documentary narrator” or “bedtime storyteller” — one tag that sets the voice for the whole generation.
- S2 Pro’s tags control moment-to-moment delivery. You want one sentence whispered, the next shouted, with a pause and a sigh in between.
Edge: Tie. Voice Design and inline tags target different problems and both do what they do well.
Emotion and prosody control
Qwen3-TTS supports instruction-driven emotion control through its CustomVoice variant. Prompts like “Speak in a very happy tone” or “Use a calm and soothing voice” are prepended in ChatML format and modulate the output. This is coarse-grained — the instruction applies to the entire utterance or large segments.
S2 Pro provides fine-grained emotion and style control through its [tag] system: [excited], [sad], [angry], [surprised], [delight], [laughing tone], [professional broadcast tone], and free-form tags like [whisper in small voice]. These can be changed mid-sentence, even mid-word.
Edge: S2 Pro. Sub-word-level inline control is strictly more expressive than utterance-level instructions.
Streaming and latency
Qwen3-TTS reports native streaming with 97ms first-packet latency. The dual-track architecture generates speech tokens as each text token arrives — no waiting for full text input. The causal ConvNet decodes each 80ms frame immediately.
S2 Pro is structurally a standard language model (decoder-only transformer), which means it can leverage SGLang’s inference optimizations: continuous batching, paged KV cache, CUDA graphs, and RadixAttention prefix caching. On an H200, it reports ~100ms time-to-first-audio and RTF of 0.195 (~5x real-time).
| Latency metric | Qwen3-TTS | S2 Pro |
|---|---|---|
| First-packet latency | 97ms (single user) | ~100ms (H200) |
| Streaming design | Native dual-track | SGLang-optimized LM |
| RTF | ~0.29 (0.6B) to ~0.31 (1.7B) | 0.195 (H200) |
| Consumer GPU streaming | Yes (vLLM-Omni, any GPU) | Yes (FP8: RTX 4090; FP16: H200-class) |
Both achieve 100ms-range first-packet latency, but on very different hardware tiers. Qwen3-TTS hits 97ms on a consumer RTX 4090. S2 Pro needs an H200 to reach ~100ms.
Edge: Qwen3-TTS. Lower hardware barrier for equivalent latency.
Multi-speaker and multi-turn dialogue
S2 Pro natively supports multi-speaker generation and multi-turn dialogue within a single inference pass. Reference audio containing multiple speakers is parsed via <|speaker:i|> tokens, and the model extracts separate speaker identities for each. Output text can switch between speakers mid-generation using speaker tags. This enables dialogue generation, podcast scripting, and multi-character narration without separate cloning or orchestration.
Qwen3-TTS does not support multi-speaker generation or multi-turn dialogue. Each generation produces a single voice. Multi-character scenarios require separate cloning passes and audio stitching.
| Capability | Qwen3-TTS | S2 Pro |
|---|---|---|
| Multi-speaker generation | ❌ | ✅ Native |
| Multi-turn dialogue | ❌ | ✅ Native |
Edge: S2 Pro. Multi-speaker support is a meaningful advantage for character-driven content.
Multilingual support
Qwen3-TTS supports 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian — plus Chinese dialects.
S2 Pro supports over 80 languages with three quality tiers. The best quality is for Japanese, English, and Chinese; excellent quality for Korean, Spanish, Portuguese, Arabic, Russian, French, German; and broad coverage for everything from Swedish to Swahili to Maori.
S2 Pro learns language identity from the text token stream itself — no phonemes or language-specific preprocessing. This is a practical advantage for deployments that need to handle unexpected languages.
Edge: S2 Pro. 80+ languages with no language-specific preprocessing is a meaningful deployment advantage.
Training data
| Aspect | Qwen3-TTS | S2 Pro |
|---|---|---|
| Training data | 5M+ hours (10 languages) | ~10M hours (80+ languages) |
| Post-training | RLHF + rule-based reward | GRPO (multi-dimensional rewards) |
| Data pipeline | Filtered + continual pre-training | Video captioning + speech captioner + quality filtering |
S2 Pro’s training pipeline is the more sophisticated of the two. The speech captioner stage (Stage 2) generates natural language descriptions of speaker demographics, speaking style, emotion, and acoustic environment, which enables the [tag] system. The GRPO alignment stage uses four reward signals simultaneously: semantic accuracy (WER), instruction adherence, acoustic preference, and timbre similarity.
Edge: S2 Pro. More training data, more sophisticated post-training pipeline.
Benchmark results
| Benchmark | Qwen3-TTS | S2 Pro |
|---|---|---|
| Seed-TTS Eval WER (ZH) | 0.77% | 0.54% |
| Seed-TTS Eval WER (EN) | 1.24% | 0.99% |
| TTS Arena Elo | — | 1339 (1st place) |
| Audio Turing Test | — | 0.515 |
| Speaker similarity (tokenizer) | 0.95 | Not directly comparable |
S2 Pro holds the top spot on TTS Arena and beats Qwen3-TTS on Seed-TTS Eval WER for both Chinese and English. These are direct comparisons from the S2 Pro technical report.
Edge: S2 Pro. Across published benchmarks, S2 Pro leads in intelligibility and human preference.
Hardware requirements
| Model | Min VRAM | Recommended | Optimal |
|---|---|---|---|
| Qwen3-TTS-0.6B | 2 GB | 4 GB | 8 GB |
| Qwen3-TTS-1.7B | 4 GB | 8 GB | 12 GB+ |
| S2 Pro (FP16) | 24 GB | 80 GB (A100) | H200-class |
| S2 Pro (FP8) | ~12 GB | 24 GB (RTX 4090) | 48 GB+ |
This is where the comparison diverges sharply. Qwen3-TTS runs on consumer laptops. S2 Pro requires at minimum a high-end desktop GPU.
Qwen3-TTS’s 0.6B variant runs in 2-5 GB VRAM and supports FlashAttention 2 and INT8 quantization. It can run on a Mac via MLX or a consumer NVIDIA GPU such as an RTX 3060.
S2 Pro at FP16 precision requires 24 GB minimum (RTX 4090). A community FP8 quantized variant drops this to approximately 12 GB while running on Ada Lovelace or Blackwell GPUs. At full precision, the recommended production setup is an A100 (80 GB) or H200 (141 GB) with SGLang-Omni and Flash Attention 3. CPU inference is not practical for either precision level.
Edge: Qwen3-TTS. Orders-of-magnitude difference in hardware accessibility.
Licensing
| Aspect | Qwen3-TTS | S2 Pro |
|---|---|---|
| Code license | Apache 2.0 | NOASSERTION (no standard OSS license) |
| Model weights | Apache 2.0 | Fish Audio Research License (non-commercial) |
| Commercial use | Yes (code + weights) | No (research license) |
This is the other clear differentiator. Qwen3-TTS’s Apache 2.0 license permits commercial use for both code and model weights. S2 Pro’s research license explicitly prohibits commercial deployment.
Teams building commercial products can use Qwen3-TTS without legal risk. S2 Pro requires a commercial agreement with Fish Audio.
Edge: Qwen3-TTS. Commercially permissive vs research-only.
The answer: Qwen3-TTS vs Fish Audio S2 Pro
Here is the decision matrix.
Choose Qwen3-TTS if:
- You need commercial use. S2 Pro’s research license makes it off-limits for commercial products without a separate agreement.
- You run on consumer hardware. Qwen3-TTS runs on a Mac or a laptop GPU. S2 Pro needs an H200 or comparable enterprise hardware.
- You want Voice Design from text descriptions. Creating entirely new voices from text prompts without reference audio.
- You need short-reference voice cloning (3 seconds). Qwen3-TTS’s speaker encoder extracts a voice from very short samples.
Choose S2 Pro if:
- Quality is the only thing that matters. S2 Pro tops TTS Arena and beats Qwen3-TTS on Seed-TTS Eval WER.
- You need fine-grained prosody control. The
[tag]system with 15,000+ tags is unmatched. - You support 80+ languages. Qwen3-TTS covers 10. S2 Pro covers 80+ with no language-specific preprocessing.
- You have enterprise GPU infrastructure. The hardware requirements are steep but the quality ceiling is higher.
The nuanced answer:
For most practical applications, Qwen3-TTS is the better choice: it runs on hardware you already have, is permissively licensed for commercial use, and offers Voice Design and short-reference cloning that S2 Pro doesn’t match.
S2 Pro is the higher-quality system, but it is locked behind a research license and enterprise GPU requirements. If you are a research team evaluating the state of the art, S2 Pro is the one to study. If you are building a product, Qwen3-TTS is the one to deploy.
Where Spokio fits
Spokio is a native Mac text-to-speech app powered by Chatterbox Turbo. It focuses on offline voice generation, local voice cloning from short samples, and batch export without cloud uploads for text, audio, or voice samples. Neither Qwen3-TTS nor S2 Pro are currently packaged in Spokio, but the comparison illustrates the range of open-weight TTS options available in 2026.
