voice cloningttsxttsopenvoicecosyvoicegpt-sovitsfish speechelevenlabsdeep learningarchitecture

Voice Cloning Models: How They Work — A Survey of Major Approaches

A technical survey of major voice cloning model approaches: XTTS-v2, OpenVoice, ElevenLabs, CosyVoice, GPT-SoVITS, Fish Speech, Orpheus, Qwen3-TTS, and Chatterbox. Covers speaker adaptation vs conditioning vs in-context learning, plus a taxonomy table for model evaluation.

Published on May 17, 202620 min read

Voice cloning is the ability to synthesize speech that sounds like a specific person, given a reference audio sample of that person speaking. It is distinct from multi-speaker TTS (which selects from a fixed set of pre-enrolled voices) and from voice conversion (which transforms one speaker’s voice into another’s while preserving content).

This article surveys several major voice cloning models — open-source and proprietary — covering how each approaches the problem architecturally, what tradeoffs they make, and where the field stands as of 2026.


The Three Core Approaches

Most voice cloning systems can be understood through one of three categories:

Approach How It Works Reference Audio Needed Quality Use Case
Speaker Adaptation Fine-tune model weights on target voice Often minutes or more High consistency when tuned well Production voices, consistent output
Speaker Conditioning Extract embedding, condition generation Often seconds to minutes Strong for quick cloning Quick cloning, instant use
In-Context Learning Prompt model with reference tokens Often short samples Model-dependent Zero-shot or few-shot workflows

Speaker Adaptation

The model receives additional training on target-speaker audio. Gradients update the weights to specialize for that voice.

Base model weights → Fine-tune on target audio (100-1000 steps) → Specialized weights

Pros: Highest quality, most consistent across diverse text. Captures subtle prosodic patterns.

Cons: Takes minutes to hours of training. Requires GPU. Model size grows per speaker. Cannot switch speakers without reloading.

Used by: ElevenLabs Professional Voice Cloning, GPT-SoVITS (few-shot mode), Orpheus fine-tuning, Coqui XTTS fine-tuning.

Speaker Conditioning

The model extracts a fixed-dimensional vector (embedding) from the reference audio and injects it as a conditioning signal at generation time. No weight updates.

Reference audio → Speaker encoder → d-vector / embedding → Conditions AR decoder
Text → AR decoder (conditioned) → Audio tokens → Vocoder → Waveform

Pros: Instant — no waiting. Single model handles unlimited speakers. Switch speakers between sentences.

Cons: Quality capped by encoder capacity. Less consistent across diverse text styles. Sensitive to reference audio quality.

Used by: XTTS-v2, OpenVoice, CosyVoice, Chatterbox, Qwen3-TTS.

In-Context Learning

The model is a language model trained on interleaved text and audio tokens. Cloning is done by including reference audio tokens in the prompt, similar to few-shot prompting in LLMs.

Prompt: [ref_audio_tokens] [ref_text] [gen_marker] [target_text]
→ LLM autoregressively generates target audio tokens

Pros: No explicit speaker encoder needed. Can leverage any number of reference examples. Emergent cross-lingual transfer.

Cons: Prompt length grows with reference. May overfit to reference prosody. Sensitive to prompt formatting.

Used by: Orpheus (pretrained mode), CosyVoice (zero-shot mode), Fish Speech.


Model-by-Model Architecture Survey

1. XTTS-v2 (Coqui)

Snapshot: 2023-era release | License: verify current terms before commercial use | Scale: large open model

XTTS-v2 is a GPT2-based autoregressive model with a Perceiver-based speaker conditioning mechanism.

Architecture:

Reference audio (6s+) → Mel-spectrogram → Perceiver encoder → 32 latent vectors
Text → GPT2 backbone (cross-attends to latents)
→ Discrete audio tokens (VQ-VAE codes) → HiFi-GAN → 24kHz waveform

Key components:

Component Detail
Backbone GPT2 (decoder-only transformer)
Speaker encoder Perceiver — inputs mel-spectrogram, outputs 32 fixed latent vectors
Conditioning Latent vectors prefixed to GPT2 input sequence (similar to a soft prompt)
Audio tokenizer VQ-VAE with discrete codebook
Vocoder HiFi-GAN
Languages 17 (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi)
Min reference ~6 seconds
Streaming ~150ms latency (Pure PyTorch, consumer GPU)

Perceiver mechanism detail: The Perceiver replaces the simpler encoder used in XTTS-v1. Instead of a single vector, it produces 32 latent vectors that capture different aspects of the speaker’s voice. These vectors are prepended to the GPT2’s input sequence as a “soft prompt.” This design:

  • Allows multiple reference audio clips (concatenated before encoding)
  • Enables speaker interpolation (blending two references)
  • Produces more consistent speaker identity than single-vector approaches

XTTS-v2 pipeline:

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
tts.tts_to_file(
    text="Hello, this is a cloned voice.",
    speaker_wav="/path/to/reference.wav",
    language="en",
    file_path="output.wav"
)

Limitations:

  • Non-commercial license (CPML)
  • Perceiver occasionally produces unstable speaker embeddings with very short (<3s) references
  • Coqui AI shut down in 2024 — no further development; community forks exist

2. OpenVoice V2 (MyShell / MIT)

Snapshot: V1/V2-era releases | License: verify current terms | Scale: smaller local model family

OpenVoice’s defining innovation is decoupling tone color from voice style. Most cloning systems couple these — the cloned voice inherits both the timbre and the prosody/accent/emotion of the reference. OpenVoice separates them.

Architecture:

Reference audio → Tone Color Encoder → tone color embedding
Target text → Base Speaker TTS → neutral mel-spectrogram
→ Tone Color Converter (VS network) → mel with target timbre
→ HiFi-GAN → waveform
Style params (emotion, accent, speed) → Style Encoder → style embedding

Decoupled design:

The OpenVoice pipeline runs in two stages:
1. A single-speaker TTS generates neutral speech from text
2. A tone color converter transplants the reference speaker's timbre
   onto the neutral speech, guided by the style embedding

Key components:

Component Detail
Base TTS Single-speaker model (trained on ~30K sentences from 4 speakers)
Tone color encoder Extracts speaker embedding (timbre only, stripped of style)
Style encoder Extracts/accent emotion/accent from reference
Tone color converter Visinger (VS) network with flow layers — transplants timbre
Vocoder HiFi-GAN
Min reference ~1 second
Languages 6 native (EN, ES, FR, ZH, JA, KO), zero-shot cross-lingual for others
Style control Emotion (happy, sad, angry), accent (British, Indian, Australian), speed

Why decoupling matters: In XTTS-v2 and similar models, the cloned voice inherits the reference speaker’s emotion and accent. If your reference is a sad recording, your cloning sounds sad. OpenVoice can take timbre from reference A and emotion from reference B (or from a text description).

# Conceptual: separate timbre and style
timbre = extract_tone_color("reference.wav")
style = extract_style("reference.wav")  # or synthetic: style = "happy, british accent"
output = synthesize(text, timbre=timbre, style=style)

Zero-shot cross-lingual mechanism: Because the base TTS model speaks fluently in multiple languages, and the tone color converter only modifies timbre, the cloned voice inherits the base TTS’s pronunciation. This means:

  • A voice cloned from English can speak French fluently (using the French base TTS)
  • Works for languages the tone color converter never trained on
  • No multilingual training data required for new languages

Limitations:

  • Two-pass pipeline doubles inference time
  • Tone color converter can introduce artifacts on extreme style combinations
  • Base TTS quality caps overall output quality
  • V2 quality improved significantly but still trails end-to-end models for naturalness

3. ElevenLabs (Proprietary)

Snapshot: cloud service, proprietary model family

ElevenLabs operates at a scale and with a level of investment that no open-source project matches. Its architecture is not public, but enough is known from documentation and analysis to reconstruct a likely design.

Two cloning tiers:

Feature Instant Voice Cloning Professional Voice Cloning
Mechanism Few-shot conditioning at inference Fine-tuning model weights
Reference audio 1-5 minutes 30+ minutes
Turnaround Seconds Minutes
Quality Good, style-dependent Excellent, consistent
Use case Quick prototyping Production, brand voices

Likely architecture (reconstructed):

Instant VC:
Reference audio → Learned speaker encoder → conditioning vector(s)
Text → Large AR decoder (cross-attends to conditioning) → audio codes
→ Neural codec decoder → waveform

Professional VC:
Base model → LoRA or full fine-tune on target speaker → specialized weights
→ Same AR decoder → consistently cloned output

What is publicly described or commonly reported:

  • Models: multiple hosted TTS and voice models with multilingual support
  • Latency: low-latency options are available, but exact performance depends on model, region, and network
  • Speech-to-Speech: Separate model that transforms input audio while preserving content and emotion
  • Voice Design: Text-prompted voice creation (not cloning, but generative)
  • Voice Library: community and licensed voice workflows are available

Key differentiators:

  • Data scale: proprietary datasets and production feedback loops
  • Model scale: undisclosed
  • Audio quality: strong hosted output in many workflows
  • Emotion control: model- and product-specific controls

Why ElevenLabs is hard to replicate:

  1. Training data — years of curated, high-quality multilingual speech from professional sources
  2. Model architecture — custom-designed, not a stock Llama or GPT2
  3. Scale — likely trained on thousands of GPUs with proprietary infrastructure
  4. Perceptual optimization — models optimized through human evaluation loops, not just loss curves

Limitations:

  • Proprietary — you do not control latency, availability, or pricing
  • API-dependent — no offline inference
  • Cost — scales linearly with usage
  • Voice cloning quality is inconsistent across reference styles

4. CosyVoice (Alibaba / FunAudioLLM)

Snapshot: actively developed open model family | License: verify current release terms | Scale: varies by release

CosyVoice is an actively developed open-source voice cloning model family. Its architecture uses supervised semantic tokens — a key innovation.

Architecture (CosyVoice 2):

Text → [LLM (autoregressive)] → supervised semantic tokens
Semantic tokens → [Flow Matching (non-autoregressive)] → mel-spectrogram
→ [HiFi-GAN] → waveform

Speaker conditioning:
Reference audio → Speaker encoder → embedding → conditions LLM + Flow Matching

The supervised semantic token innovation: Unlike XTTS-v2 (which uses unsupervised VQ-VAE tokens) and Orpheus (which uses SNAC codec tokens), CosyVoice uses tokens derived from a multilingual speech recognition (ASR) model. These tokens have explicit alignment to text phonemes, which means:

  • The LLM does not need to learn text-to-phoneme alignment — it is built into the token representation
  • Speaker identity and prosody remain in the residual codec layers
  • Cross-lingual cloning works because the semantics are language-agnostic
Token hierarchy:
Layer 0: Supervised semantic tokens (ASR-derived, phone-like)
Layer 1-N: Residual acoustic tokens (timbre, prosody, style)

CosyVoice 3 upgrades:

Feature CosyVoice 2 CosyVoice 3
Parameters 0.5B 1.5B
Training data ~10K hours 1M hours
Languages 4 9 + 18 Chinese dialects
Tokenizer ASR-based Multi-task (ASR + emotion + language ID + audio events)
Streaming ~150ms ~150ms
Post-training None DiffRO (Differentiable Reward Optimization)
Pronunciation control No Yes (Pinyin + CMU phoneme inpainting)

CosyVoice 3 tokenizer training: The novel tokenizer is trained on multiple tasks simultaneously:

  • ASR (content accuracy)
  • Speech emotion recognition (prosody preservation)
  • Language identification (multilingual consistency)
  • Audio event detection (non-speech sounds)
  • Speaker analysis (timbre preservation)

This produces tokens that carry more information than any single-task tokenizer, enabling better prosody and emotion transfer during cloning.

Zero-shot cloning usage:

from cosyvoice.cli import CosyVoice2

model = CosyVoice2('pretrained_models/CosyVoice2-0.5B')
output = model.inference_zero_shot(
    tts_text="Hello, this is a cloned voice.",
    prompt_text="Reference text spoken in the prompt audio",
    prompt_speech_16k=ref_audio_16k,
)

Limitations:

  • Relatively large (0.5B-1.5B) compared to lightweight models
  • ASR-dependent tokenizer ties quality to ASR model quality
  • Flow matching decoder adds ~50-100ms additional latency over pure AR

5. GPT-SoVITS (RVC-Boss)

Snapshot: 2024-era open-source system | License: verify current terms | Scale: multi-component model

GPT-SoVITS is a popular few-shot voice cloning system in the open-source community. Its defining characteristic is support for short-reference and few-shot workflows, depending on setup and quality target.

Architecture:

Stage 1: GPT (Text → Semantic Tokens)
Text → Phonemes + BERT features → AR transformer (GPT) → semantic tokens
Reference audio → CNHuBERT → SSL features → conditions GPT

Stage 2: SoVITS (Semantic Tokens → Waveform)
Semantic tokens + Reference spectrogram → VITS-based decoder → waveform

Two-stage pipeline:

Raw text
  → Text processing: phoneme conversion + BERT feature extraction (1024-dim)
  → Combined with CNHuBERT SSL features from reference (768-dim)
  → GPT autoregressive decoder → semantic token sequence
  → SoVITS acoustic model (VITS variant) → mel-spectrogram
  → Neural vocoder → waveform

Key components:

Component Detail
GPT model Autoregressive transformer, predicts semantic tokens
SoVITS model VITS-based, with improved posterior encoder and flow-based decoder
BERT encoder Chinese RoBERTa (1024-dim) — contextual text embeddings
CNHuBERT Chinese HuBERT — SSL features from reference audio
Vocoder V3/V4: HiFi-GAN or neural vocoder (varies by version)
Min reference (zero-shot) 5 seconds
Min reference (few-shot) 1 minute
Languages ZH, EN, JP, KO, Cantonese
Cross-lingual Yes — clone from one language, generate in another
v2 ProPlus RTF 0.028 (4060Ti), 0.014 (4090), 0.526 (M4 CPU)

Few-shot fine-tuning workflow:

1. Record/spilt ~1 min of clean reference audio
2. ASR alignment (automatic — built into WebUI)
3. Fine-tune GPT + SoVITS models (10-30 min on consumer GPU)
4. Inference with fine-tuned weights

Why it works with so little data: The separation of semantic and acoustic generation is key. The GPT model only needs to learn the text-to-semantic-token mapping (which generalizes from pre-training). The SoVITS model adapts the acoustic details using the reference spectrogram as a guide. The CNHuBERT SSL features provide a strong prior on speaker identity without requiring many parameters.

Limitations:

  • Two-stage inference is non-streaming (generate tokens → decode audio)
  • Quality degrades on long text without chunking
  • BERT and CNHuBERT dependencies add model load time
  • Primarily optimized for Chinese; English quality trails dedicated English models
  • Complex dependency chain (multiple models must be loaded)

6. Fish Speech (Fish Audio)

Snapshot: Fish Speech / Fish Audio model family | License: verify release-specific terms | Scale: varies by release

Fish Speech’s distinctive architecture is the Dual-AR (Dual Autoregressive) design — two transformers running at different granularities.

Dual-AR Architecture:

Text → Slow Transformer (~4B) → Primary codebook tokens (temporal structure)
Primary tokens → Fast Transformer (~400M) → Residual codebook tokens (acoustic detail)
Combined tokens → Firefly-GAN vocoder → waveform

Why dual autoregressive: Standard single-AR models must predict all codebook levels at each step, which creates a conflict: the coarse semantic structure and the fine acoustic detail compete for modeling capacity.

Fish Speech solves this by separating the problem:

Slow AR (temporal modeling):
  "I am speaking this sentence" → primary token per frame
  Focus: content, prosody, pacing

Fast AR (acoustic modeling):
  Primary tokens → residual tokens (fine detail)
  Focus: timbre, breathiness, articulation

S2 Pro elevates this further with a ~4B slow AR and ~400M fast AR, trained on millions of hours of data.

Multilingual training scale (V1.5):

Language Training Data WER/CER
English 300K+ hours 3.5% WER
Chinese 300K+ hours 1.3% CER
Japanese 100K+ hours
German, French, Spanish, Korean, Arabic, Russian ~20K hours each
Dutch, Italian, Polish, Portuguese <10K hours each

S2 Pro upgrades:

Feature V1.5 S2 Pro
Slow AR ~500M ~4B
Fast AR ~100M ~400M
Training data 1M hours 10M+ hours
Languages 13 80+
Voice cloning Zero-shot (10-15s) Zero-shot (10-30s)
Latency ~200ms ~100ms (via SGLang)
ELO (TTS Arena) 1011 1339
Multi-speaker No Yes (native via <|speaker:i|> tokens)

Voice cloning mechanism: Fish Speech uses in-context learning. Reference audio is encoded through the same codec pipeline, producing token sequences that are included as context in the autoregressive generation. The Slow AR attends to the reference token patterns to reproduce the speaker’s timbre and style.

# Conceptual: S2 Pro voice cloning
ref_tokens = encode_audio(reference_wav)  # 10-30 seconds
prompt = ref_tokens + text_tokens
output_tokens = slow_ar.generate(prompt) + fast_ar.generate(output_tokens)
waveform = firefly_gan.decode(output_tokens)

Firefly-GAN vocoder: Fish Speech’s vocoder uses depthwise/dilated convolutions with grouped scalar vector quantization. It achieves near 100% codebook utilization (unlike RVQ codecs which typically leave many codes unused), enabling more expressive and detailed audio output.

Limitations:

  • CC-BY-NC-SA license prevents commercial use (V1.5)
  • S2 Pro is effectively proprietary (weights available, inference via API)
  • Dual-AR adds architectural complexity vs single-model approaches
  • S2 Pro requires significant GPU resources at ~4.4B parameters

7. Orpheus (Canopy Labs)

Snapshot: 2025-era model family | License: verify current terms | Scale: large local model

Covered in depth in the Orpheus deep dive, but relevant here for its cloning approach.

Voice cloning mechanism: Orpheus does not use a speaker encoder. Instead, voice cloning is done through fine-tuning the Llama-3.2-3B backbone on target speaker data. The finetuned model has 8 preset voices; custom voices require additional fine-tuning.

Preset voices: tara, leah, jess, leo, dan, mia, zac, zoe
→ Built from fine-tuning on ~50-300 examples per voice

Custom voice cloning:
→ Collect 50-300 audio examples per speaker
→ Fine-tune Llama backbone (standard HuggingFace Trainer)
→ Model learns to associate the speaker's acoustic patterns with text

The pretrained model supports a limited form of in-context conditioning by including text-speech pairs in the prompt, but this is less reliable than dedicated speaker encoder approaches.

Key specs:

  • 8 preset voices (English)
  • 7 language pairs in multilingual research release
  • 3B parameters — requires GPU
  • Apache 2.0 license
  • Fine-tuning data: 50-300 examples per speaker recommended

8. Qwen3-TTS (Alibaba / Qwen Team)

Snapshot: 2026-era model family | License: verify current terms | Scale: varies by release

Covered in depth in the Qwen3-TTS deep dive. Its voice cloning approach is distinct.

Voice cloning mechanism: Qwen3-TTS uses a learnable speaker encoder trained jointly with the dual-track LM backbone. Reference audio (3 seconds) is encoded through the Qwen-TTS-Tokenizer, and a speaker embedding is extracted and used to condition every generation step.

Reference audio (3s) → Qwen-TTS-Tokenizer → speech codes
                     → Learnable speaker encoder → speaker embedding (conditions LM)

Dual-track LM: text + speaker embedding → 12Hz codec codes → causal ConvNet → waveform

Evaluation points:

  • Short-reference cloning behavior
  • Supported languages
  • Streaming latency on target hardware
  • Speaker similarity across long passages
  • Text-description-based voice design, if available in the release

9. Chatterbox (Resemble AI)

Snapshot: 2025-era model family | License: verify current terms | Scale: medium local model

Covered in the Chatterbox deep dive. Chatterbox takes a three-stage approach to cloning.

Voice cloning mechanism:

Reference audio (5s+)
  → S3 tokenizer → 150 speech tokens (conditioning prompt)
  → CAMPPlus speaker encoder → 256-dim x-vector (speaker embedding)
  → Both condition the T3 Llama backbone

T3 (AR): text + conditioning → S3 tokens
S3Gen (CFM): S3 tokens → mel-spectrogram
HiFi-GAN: mel → waveform

Evaluation points:

  • Short-reference cloning behavior
  • English and multilingual release differences
  • Expressiveness controls available in the current release
  • License terms
  • Provenance or watermarking support, if available

Comparison Table

Model Cloning Method Ref Audio Languages Scale License Streaming Vocoder Snapshot
XTTS-v2 Conditioning (Perceiver) Short to medium Multilingual Large Verify Runtime-dependent HiFi-GAN 2023
OpenVoice V2 Conditioning (decoupled) Short Multilingual Smaller Verify Runtime-dependent HiFi-GAN 2024
ElevenLabs Conditioning + fine-tune products Product-dependent Multilingual Unknown Proprietary Cloud/network dependent Proprietary 2023+
CosyVoice Conditioning (ASR tokens) Short to medium Multilingual Varies Verify Runtime-dependent HiFi-GAN 2024+
GPT-SoVITS Fine-tune / few-shot Short to medium Multilingual Multi-component Verify Runtime-dependent VITS/HiFi-GAN 2024
Fish Speech In-context (Dual-AR) Short to medium Multilingual Varies Verify Runtime-dependent Firefly-GAN 2024+
Orpheus Fine-tune / prompted variants Release-dependent English-focused / research multilingual variants Large Verify Runtime-dependent SNAC 2025
Qwen3-TTS Conditioning (speaker encoder) Short-reference claims to verify Multilingual releases Varies Verify Runtime-dependent Release-dependent 2026
Chatterbox Conditioning Short-reference cloning English-focused / release-dependent Medium Verify Runtime-dependent HiFi-GAN 2025

Cloning method legend: Conditioning = speaker embedding extracted at inference; Fine-tune = weights updated per speaker; In-context = audio tokens in prompt


Decision Guide: Which Model to Use

Your Priority Candidate Models What to Verify
Quality ElevenLabs, Fish Speech, Qwen-family, Chatterbox Current output quality on your text and voice
Language coverage Fish Speech, CosyVoice, Qwen-family, cloud services Current model card and target-language pronunciation
Small local setup OpenVoice, lightweight local models Hardware needs and acceptable quality
Expressiveness Chatterbox, Orpheus-style, cloud services Supported controls in the exact release
Low latency Cloud realtime products, Qwen-family, optimized local runtimes Measured latency on target hardware/network
Few-shot fine-tuning GPT-SoVITS, adaptation-based systems Data requirement, setup complexity, license
Commercial use CosyVoice, OpenVoice, Chatterbox, other permissive releases Current license and model provenance
Voice design Qwen-family or cloud voice design products Whether it is cloning, voice design, or both
Privacy / offline OpenVoice, CosyVoice, Chatterbox, XTTS-style local tools Whether all processing stays local
Cross-lingual cloning OpenVoice, XTTS-style, cloud services Target-language quality and consent scope

Common Failure Modes Across All Models

Failure Mode Cause Affected Models
Timbre drift Reference too short or noisy All models
Style bleeding Cloned voice inherits undesired emotion from reference XTTS-v2, CosyVoice, Chatterbox
Cross-lingual accent Language-specific phone inventory mismatch XTTS-v2, GPT-SoVITS
Repetition loops AR model gets stuck on tokens Orpheus, GPT-SoVITS, Fish Speech
Robotic prosody Conditioning vector loses prosodic nuance OpenVoice (base TTS bottleneck)
Poor consistency Embedding varies between generations XTTS-v2 (Perceiver instability)
Vocoder artifacts Out-of-distribution acoustic features All GAN-based vocoders

Summary

Voice cloning models fall into three technical categories (conditioning, fine-tuning, in-context learning), and the best choice depends on your constraints. The open-source landscape in 2026 offers several useful options, but model cards, licenses, language support, and runtime behavior change quickly. Verify the exact release before using any model commercially.

The proprietary frontier can still be strong on hosted quality and production tooling, while local models keep improving for privacy-sensitive and offline workflows.

More from the blog