Qwen3-TTS vs Fish Audio S2 Pro: Technical Comparison

Qwen3-TTS and Fish Audio S2 Pro represent two different answers to the same question: how do you build the best open-weight text-to-speech system? Both are at the top of the TTS leaderboards in 2026, but they take fundamentally different architectural paths to get there.

This comparison covers their architectures, tokenizer design, voice cloning pipelines, streaming capabilities, multilingual support, training data, inference performance, and licensing — then answers which one you should use and why.

Architecture philosophy

The fundamental difference is how each model decomposes the speech generation problem.

Qwen3-TTS uses a dual-track language model that processes text and speech tokens in parallel within a single architecture. Its 12Hz tokenizer uses a 16-layer residual vector quantization (RVQ) scheme where the first codebook encodes semantic content (guided by a WavLM teacher) and the remaining 15 layers capture acoustic detail. A lightweight causal ConvNet decodes codes directly to waveform — no vocoder, no flow matching, no separate modules.

Fish Audio S2 Pro uses a master-slave dual-autoregressive (Dual-AR) design: a 4B parameter Slow AR predicts the primary codebook at ~21Hz (temporal structure, content, prosody), a 400M parameter Fast AR predicts the remaining 9 residual codebooks per frame (acoustic detail, timbre), and a DAC vocoder reconstructs the waveform.

Aspect	Qwen3-TTS	Fish Audio S2 Pro
Paradigm	Dual-track LM (parallel text+speech)	Dual-AR master-slave (Slow + Fast AR)
Speech representation	12Hz 16-layer RVQ multi-codebook	~21Hz 10-layer RVQ multi-codebook
Decoder	Causal ConvNet (direct waveform)	DAC (Descript Audio Codec) vocoder
Architecture complexity	1 unified model	2 autoregressive LMs + codec
Parameter count	600M / 1.7B	4B (Slow) + 400M (Fast)

Qwen3-TTS keeps the entire pipeline in a single model with a lightweight decoder. S2 Pro separates temporal and acoustic modeling into two transformers of very different sizes.

Tokenizer design

Qwen3-TTS’s 12Hz tokenizer operates at 12.5 frames per second (one frame per 80ms). Each frame encodes 16 codebook indices: the first is semantically supervised via WavLM, and the remaining 15 are RVQ residual layers. This hierarchical design means a single frame encodes both “what was said” and “how it sounded” in a dense representation.

S2 Pro’s RVQ codec operates at ~21 Hz (one frame per ~47ms). Each frame encodes 10 codebook indices. Unlike Qwen3-TTS, there is no explicit semantic supervision — the primary codebook learns temporal structure implicitly through autoregressive training on a massive text+speech corpus.

Tokenizer metric	Qwen3-TTS (12Hz)	S2 Pro
Frame rate	12.5 Hz (80ms)	~21 Hz (~47ms)
Codebooks per frame	16	10
Semantic supervision	WavLM teacher (codebook 0)	Implicit (AR training)
Reported PESQ	3.68 (narrowband)	Not directly reported
Reported STOI	0.96	Not directly reported

The practical implications:

S2 Pro’s higher frame rate (21 vs 12.5 Hz) gives finer temporal resolution, which likely contributes to its top TTS Arena ranking.
Qwen3-TTS’s 16-layer codebook per frame packs more information per time step, reducing the number of autoregressive predictions needed per second.
Qwen3-TTS’s causal ConvNet decoder is simpler and faster than S2 Pro’s DAC vocoder pipeline.

Edge: Tie. Different design philosophies, both effective in their respective quality benchmarks.

Voice cloning

Qwen3-TTS supports zero-shot voice cloning from approximately 3 seconds of reference audio. The reference is encoded through the Qwen-TTS-Tokenizer, a learnable speaker encoder extracts an embedding, and that embedding conditions the dual-track LM during generation.

S2 Pro uses in-context voice cloning rather than a separate speaker encoder. Reference audio is encoded through the same RVQ codec and prepended as a token prefix to the Slow AR input. This means the model conditions directly on the reference token patterns rather than a compressed embedding. Recommended reference length is 10-30 seconds.

Cloning aspect	Qwen3-TTS	S2 Pro
Method	Learnable speaker encoder	In-context token prefix
Min reference	~3 seconds	10-30 seconds (recommended)
Output sample rate	24 kHz	44.1 kHz

The 3-second threshold gives Qwen3-TTS an edge for quick cloning scenarios, but S2 Pro’s in-context approach means longer references (which carry more speaker information) are naturally handled without architectural changes.

Edge: Qwen3-TTS for short-reference scenarios; S2 Pro for high-fidelity cloning from longer samples.

Voice Design vs Natural language tags

Qwen3-TTS VoiceDesign accepts natural language descriptions like “a warm, middle-aged female voice with a gentle tone, suitable for bedtime stories” and generates speech matching that description — no reference audio needed. The voice description is treated as a system prompt that modulates the speaker embedding and prosody conditioning.

S2 Pro’s [tag] system offers fine-grained inline control over prosody, emotion, and speaking style at the sub-word level. Tags like [whisper], [excited], [slow down] are embedded in the text stream and condition the Slow AR. Over 15,000 unique tags are supported, including free-form text.

These are different capabilities aimed at different use cases:

Voice Design creates entire voices from description. You want a “documentary narrator” or “bedtime storyteller” — one tag that sets the voice for the whole generation.
S2 Pro’s tags control moment-to-moment delivery. You want one sentence whispered, the next shouted, with a pause and a sigh in between.

Edge: Tie. Voice Design and inline tags target different problems and both do what they do well.

Emotion and prosody control

Qwen3-TTS supports instruction-driven emotion control through its CustomVoice variant. Prompts like “Speak in a very happy tone” or “Use a calm and soothing voice” are prepended in ChatML format and modulate the output. This is coarse-grained — the instruction applies to the entire utterance or large segments.

S2 Pro provides fine-grained emotion and style control through its [tag] system: [excited], [sad], [angry], [surprised], [delight], [laughing tone], [professional broadcast tone], and free-form tags like [whisper in small voice]. These can be changed mid-sentence, even mid-word.

Edge: S2 Pro. Sub-word-level inline control is strictly more expressive than utterance-level instructions.

Streaming and latency

Qwen3-TTS reports native streaming with 97ms first-packet latency. The dual-track architecture generates speech tokens as each text token arrives — no waiting for full text input. The causal ConvNet decodes each 80ms frame immediately.

S2 Pro is structurally a standard language model (decoder-only transformer), which means it can leverage SGLang’s inference optimizations: continuous batching, paged KV cache, CUDA graphs, and RadixAttention prefix caching. On an H200, it reports ~100ms time-to-first-audio and RTF of 0.195 (~5x real-time).

Latency metric	Qwen3-TTS	S2 Pro
First-packet latency	97ms (single user)	~100ms (H200)
Streaming design	Native dual-track	SGLang-optimized LM
RTF	~0.29 (0.6B) to ~0.31 (1.7B)	0.195 (H200)
Consumer GPU streaming	Yes (vLLM-Omni, any GPU)	Yes (FP8: RTX 4090; FP16: H200-class)

Both achieve 100ms-range first-packet latency, but on very different hardware tiers. Qwen3-TTS hits 97ms on a consumer RTX 4090. S2 Pro needs an H200 to reach ~100ms.

Edge: Qwen3-TTS. Lower hardware barrier for equivalent latency.

Multi-speaker and multi-turn dialogue

S2 Pro natively supports multi-speaker generation and multi-turn dialogue within a single inference pass. Reference audio containing multiple speakers is parsed via <|speaker:i|> tokens, and the model extracts separate speaker identities for each. Output text can switch between speakers mid-generation using speaker tags. This enables dialogue generation, podcast scripting, and multi-character narration without separate cloning or orchestration.

Qwen3-TTS does not support multi-speaker generation or multi-turn dialogue. Each generation produces a single voice. Multi-character scenarios require separate cloning passes and audio stitching.

Capability	Qwen3-TTS	S2 Pro
Multi-speaker generation	❌	✅ Native
Multi-turn dialogue	❌	✅ Native

Edge: S2 Pro. Multi-speaker support is a meaningful advantage for character-driven content.

Multilingual support

Qwen3-TTS supports 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian — plus Chinese dialects.

S2 Pro supports over 80 languages with three quality tiers. The best quality is for Japanese, English, and Chinese; excellent quality for Korean, Spanish, Portuguese, Arabic, Russian, French, German; and broad coverage for everything from Swedish to Swahili to Maori.

S2 Pro learns language identity from the text token stream itself — no phonemes or language-specific preprocessing. This is a practical advantage for deployments that need to handle unexpected languages.

Edge: S2 Pro. 80+ languages with no language-specific preprocessing is a meaningful deployment advantage.

Training data

Aspect	Qwen3-TTS	S2 Pro
Training data	5M+ hours (10 languages)	~10M hours (80+ languages)
Post-training	RLHF + rule-based reward	GRPO (multi-dimensional rewards)
Data pipeline	Filtered + continual pre-training	Video captioning + speech captioner + quality filtering

S2 Pro’s training pipeline is the more sophisticated of the two. The speech captioner stage (Stage 2) generates natural language descriptions of speaker demographics, speaking style, emotion, and acoustic environment, which enables the [tag] system. The GRPO alignment stage uses four reward signals simultaneously: semantic accuracy (WER), instruction adherence, acoustic preference, and timbre similarity.

Edge: S2 Pro. More training data, more sophisticated post-training pipeline.

Benchmark results

Benchmark	Qwen3-TTS	S2 Pro
Seed-TTS Eval WER (ZH)	0.77%	0.54%
Seed-TTS Eval WER (EN)	1.24%	0.99%
TTS Arena Elo	—	1339 (1st place)
Audio Turing Test	—	0.515
Speaker similarity (tokenizer)	0.95	Not directly comparable

S2 Pro holds the top spot on TTS Arena and beats Qwen3-TTS on Seed-TTS Eval WER for both Chinese and English. These are direct comparisons from the S2 Pro technical report.

Edge: S2 Pro. Across published benchmarks, S2 Pro leads in intelligibility and human preference.

Hardware requirements

Model	Min VRAM	Recommended	Optimal
Qwen3-TTS-0.6B	2 GB	4 GB	8 GB
Qwen3-TTS-1.7B	4 GB	8 GB	12 GB+
S2 Pro (FP16)	24 GB	80 GB (A100)	H200-class
S2 Pro (FP8)	~12 GB	24 GB (RTX 4090)	48 GB+

This is where the comparison diverges sharply. Qwen3-TTS runs on consumer laptops. S2 Pro requires at minimum a high-end desktop GPU.

Qwen3-TTS’s 0.6B variant runs in 2-5 GB VRAM and supports FlashAttention 2 and INT8 quantization. It can run on a Mac via MLX or a consumer NVIDIA GPU such as an RTX 3060.

S2 Pro at FP16 precision requires 24 GB minimum (RTX 4090). A community FP8 quantized variant drops this to approximately 12 GB while running on Ada Lovelace or Blackwell GPUs. At full precision, the recommended production setup is an A100 (80 GB) or H200 (141 GB) with SGLang-Omni and Flash Attention 3. CPU inference is not practical for either precision level.

Edge: Qwen3-TTS. Orders-of-magnitude difference in hardware accessibility.

Licensing

Aspect	Qwen3-TTS	S2 Pro
Code license	Apache 2.0	NOASSERTION (no standard OSS license)
Model weights	Apache 2.0	Fish Audio Research License (non-commercial)
Commercial use	Yes (code + weights)	No (research license)

This is the other clear differentiator. Qwen3-TTS’s Apache 2.0 license permits commercial use for both code and model weights. S2 Pro’s research license explicitly prohibits commercial deployment.

Teams building commercial products can use Qwen3-TTS without legal risk. S2 Pro requires a commercial agreement with Fish Audio.

Edge: Qwen3-TTS. Commercially permissive vs research-only.

The answer: Qwen3-TTS vs Fish Audio S2 Pro

Here is the decision matrix.

Choose Qwen3-TTS if:

You need commercial use. S2 Pro’s research license makes it off-limits for commercial products without a separate agreement.
You run on consumer hardware. Qwen3-TTS runs on a Mac or a laptop GPU. S2 Pro needs an H200 or comparable enterprise hardware.
You want Voice Design from text descriptions. Creating entirely new voices from text prompts without reference audio.
You need short-reference voice cloning (3 seconds). Qwen3-TTS’s speaker encoder extracts a voice from very short samples.

Choose S2 Pro if:

Quality is the only thing that matters. S2 Pro tops TTS Arena and beats Qwen3-TTS on Seed-TTS Eval WER.
You need fine-grained prosody control. The [tag] system with 15,000+ tags is unmatched.
You support 80+ languages. Qwen3-TTS covers 10. S2 Pro covers 80+ with no language-specific preprocessing.
You have enterprise GPU infrastructure. The hardware requirements are steep but the quality ceiling is higher.

The nuanced answer:

For most practical applications, Qwen3-TTS is the better choice: it runs on hardware you already have, is permissively licensed for commercial use, and offers Voice Design and short-reference cloning that S2 Pro doesn’t match.

S2 Pro is the higher-quality system, but it is locked behind a research license and enterprise GPU requirements. If you are a research team evaluating the state of the art, S2 Pro is the one to study. If you are building a product, Qwen3-TTS is the one to deploy.

Where Spokio fits

Spokio is a native Mac text-to-speech app powered by Chatterbox Turbo. It focuses on offline voice generation, local voice cloning from short samples, and batch export without cloud uploads for text, audio, or voice samples. Neither Qwen3-TTS nor S2 Pro are currently packaged in Spokio, but the comparison illustrates the range of open-weight TTS options available in 2026.