Voice Cloning Models: How They Work — A Survey of Major Approaches

Voice cloning is the ability to synthesize speech that sounds like a specific person, given a reference audio sample of that person speaking. It is distinct from multi-speaker TTS (which selects from a fixed set of pre-enrolled voices) and from voice conversion (which transforms one speaker’s voice into another’s while preserving content).

This article surveys several major voice cloning models — open-source and proprietary — covering how each approaches the problem architecturally, what tradeoffs they make, and where the field stands as of 2026.

The Three Core Approaches

Most voice cloning systems can be understood through one of three categories:

Approach	How It Works	Reference Audio Needed	Quality	Use Case
Speaker Adaptation	Fine-tune model weights on target voice	Often minutes or more	High consistency when tuned well	Production voices, consistent output
Speaker Conditioning	Extract embedding, condition generation	Often seconds to minutes	Strong for quick cloning	Quick cloning, instant use
In-Context Learning	Prompt model with reference tokens	Often short samples	Model-dependent	Zero-shot or few-shot workflows

Speaker Adaptation

The model receives additional training on target-speaker audio. Gradients update the weights to specialize for that voice.

Base model weights → Fine-tune on target audio (100-1000 steps) → Specialized weights

Pros: Highest quality, most consistent across diverse text. Captures subtle prosodic patterns.

Cons: Takes minutes to hours of training. Requires GPU. Model size grows per speaker. Cannot switch speakers without reloading.

Used by: ElevenLabs Professional Voice Cloning, GPT-SoVITS (few-shot mode), Orpheus fine-tuning, Coqui XTTS fine-tuning.

Speaker Conditioning

The model extracts a fixed-dimensional vector (embedding) from the reference audio and injects it as a conditioning signal at generation time. No weight updates.

Reference audio → Speaker encoder → d-vector / embedding → Conditions AR decoder
Text → AR decoder (conditioned) → Audio tokens → Vocoder → Waveform

Pros: Instant — no waiting. Single model handles unlimited speakers. Switch speakers between sentences.

Cons: Quality capped by encoder capacity. Less consistent across diverse text styles. Sensitive to reference audio quality.

Used by: XTTS-v2, OpenVoice, CosyVoice, Chatterbox, Qwen3-TTS.

In-Context Learning

The model is a language model trained on interleaved text and audio tokens. Cloning is done by including reference audio tokens in the prompt, similar to few-shot prompting in LLMs.

Prompt: [ref_audio_tokens] [ref_text] [gen_marker] [target_text]
→ LLM autoregressively generates target audio tokens

Pros: No explicit speaker encoder needed. Can leverage any number of reference examples. Emergent cross-lingual transfer.

Cons: Prompt length grows with reference. May overfit to reference prosody. Sensitive to prompt formatting.

Used by: Orpheus (pretrained mode), CosyVoice (zero-shot mode), Fish Speech.

Model-by-Model Architecture Survey

1. XTTS-v2 (Coqui)

Snapshot: 2023-era release | License: verify current terms before commercial use | Scale: large open model

XTTS-v2 is a GPT2-based autoregressive model with a Perceiver-based speaker conditioning mechanism.

Architecture:

Reference audio (6s+) → Mel-spectrogram → Perceiver encoder → 32 latent vectors
Text → GPT2 backbone (cross-attends to latents)
→ Discrete audio tokens (VQ-VAE codes) → HiFi-GAN → 24kHz waveform

Key components:

Component	Detail
Backbone	GPT2 (decoder-only transformer)
Speaker encoder	Perceiver — inputs mel-spectrogram, outputs 32 fixed latent vectors
Conditioning	Latent vectors prefixed to GPT2 input sequence (similar to a soft prompt)
Audio tokenizer	VQ-VAE with discrete codebook
Vocoder	HiFi-GAN
Languages	17 (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi)
Min reference	~6 seconds
Streaming	~150ms latency (Pure PyTorch, consumer GPU)

Perceiver mechanism detail: The Perceiver replaces the simpler encoder used in XTTS-v1. Instead of a single vector, it produces 32 latent vectors that capture different aspects of the speaker’s voice. These vectors are prepended to the GPT2’s input sequence as a “soft prompt.” This design:

Allows multiple reference audio clips (concatenated before encoding)
Enables speaker interpolation (blending two references)
Produces more consistent speaker identity than single-vector approaches

XTTS-v2 pipeline:

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
tts.tts_to_file(
    text="Hello, this is a cloned voice.",
    speaker_wav="/path/to/reference.wav",
    language="en",
    file_path="output.wav"
)

Limitations:

Non-commercial license (CPML)
Perceiver occasionally produces unstable speaker embeddings with very short (<3s) references
Coqui AI shut down in 2024 — no further development; community forks exist

2. OpenVoice V2 (MyShell / MIT)

Snapshot: V1/V2-era releases | License: verify current terms | Scale: smaller local model family

OpenVoice’s defining innovation is decoupling tone color from voice style. Most cloning systems couple these — the cloned voice inherits both the timbre and the prosody/accent/emotion of the reference. OpenVoice separates them.

Architecture:

Reference audio → Tone Color Encoder → tone color embedding
Target text → Base Speaker TTS → neutral mel-spectrogram
→ Tone Color Converter (VS network) → mel with target timbre
→ HiFi-GAN → waveform
Style params (emotion, accent, speed) → Style Encoder → style embedding

Decoupled design:

The OpenVoice pipeline runs in two stages:
1. A single-speaker TTS generates neutral speech from text
2. A tone color converter transplants the reference speaker's timbre
   onto the neutral speech, guided by the style embedding

Key components:

Component	Detail
Base TTS	Single-speaker model (trained on ~30K sentences from 4 speakers)
Tone color encoder	Extracts speaker embedding (timbre only, stripped of style)
Style encoder	Extracts/accent emotion/accent from reference
Tone color converter	Visinger (VS) network with flow layers — transplants timbre
Vocoder	HiFi-GAN
Min reference	~1 second
Languages	6 native (EN, ES, FR, ZH, JA, KO), zero-shot cross-lingual for others
Style control	Emotion (happy, sad, angry), accent (British, Indian, Australian), speed

Why decoupling matters: In XTTS-v2 and similar models, the cloned voice inherits the reference speaker’s emotion and accent. If your reference is a sad recording, your cloning sounds sad. OpenVoice can take timbre from reference A and emotion from reference B (or from a text description).

# Conceptual: separate timbre and style
timbre = extract_tone_color("reference.wav")
style = extract_style("reference.wav")  # or synthetic: style = "happy, british accent"
output = synthesize(text, timbre=timbre, style=style)

Zero-shot cross-lingual mechanism: Because the base TTS model speaks fluently in multiple languages, and the tone color converter only modifies timbre, the cloned voice inherits the base TTS’s pronunciation. This means:

A voice cloned from English can speak French fluently (using the French base TTS)
Works for languages the tone color converter never trained on
No multilingual training data required for new languages

Limitations:

Two-pass pipeline doubles inference time
Tone color converter can introduce artifacts on extreme style combinations
Base TTS quality caps overall output quality
V2 quality improved significantly but still trails end-to-end models for naturalness

3. ElevenLabs (Proprietary)

Snapshot: cloud service, proprietary model family

ElevenLabs operates at a scale and with a level of investment that no open-source project matches. Its architecture is not public, but enough is known from documentation and analysis to reconstruct a likely design.

Two cloning tiers:

Feature	Instant Voice Cloning	Professional Voice Cloning
Mechanism	Few-shot conditioning at inference	Fine-tuning model weights
Reference audio	1-5 minutes	30+ minutes
Turnaround	Seconds	Minutes
Quality	Good, style-dependent	Excellent, consistent
Use case	Quick prototyping	Production, brand voices

Likely architecture (reconstructed):

Instant VC:
Reference audio → Learned speaker encoder → conditioning vector(s)
Text → Large AR decoder (cross-attends to conditioning) → audio codes
→ Neural codec decoder → waveform

Professional VC:
Base model → LoRA or full fine-tune on target speaker → specialized weights
→ Same AR decoder → consistently cloned output

What is publicly described or commonly reported:

Models: multiple hosted TTS and voice models with multilingual support
Latency: low-latency options are available, but exact performance depends on model, region, and network
Speech-to-Speech: Separate model that transforms input audio while preserving content and emotion
Voice Design: Text-prompted voice creation (not cloning, but generative)
Voice Library: community and licensed voice workflows are available

Key differentiators:

Data scale: proprietary datasets and production feedback loops
Model scale: undisclosed
Audio quality: strong hosted output in many workflows
Emotion control: model- and product-specific controls

Why ElevenLabs is hard to replicate:

Training data — years of curated, high-quality multilingual speech from professional sources
Model architecture — custom-designed, not a stock Llama or GPT2
Scale — likely trained on thousands of GPUs with proprietary infrastructure
Perceptual optimization — models optimized through human evaluation loops, not just loss curves

Limitations:

Proprietary — you do not control latency, availability, or pricing
API-dependent — no offline inference
Cost — scales linearly with usage
Voice cloning quality is inconsistent across reference styles

4. CosyVoice (Alibaba / FunAudioLLM)

Snapshot: actively developed open model family | License: verify current release terms | Scale: varies by release

CosyVoice is an actively developed open-source voice cloning model family. Its architecture uses supervised semantic tokens — a key innovation.

Architecture (CosyVoice 2):

Text → [LLM (autoregressive)] → supervised semantic tokens
Semantic tokens → [Flow Matching (non-autoregressive)] → mel-spectrogram
→ [HiFi-GAN] → waveform

Speaker conditioning:
Reference audio → Speaker encoder → embedding → conditions LLM + Flow Matching

The supervised semantic token innovation: Unlike XTTS-v2 (which uses unsupervised VQ-VAE tokens) and Orpheus (which uses SNAC codec tokens), CosyVoice uses tokens derived from a multilingual speech recognition (ASR) model. These tokens have explicit alignment to text phonemes, which means:

The LLM does not need to learn text-to-phoneme alignment — it is built into the token representation
Speaker identity and prosody remain in the residual codec layers
Cross-lingual cloning works because the semantics are language-agnostic

Token hierarchy:
Layer 0: Supervised semantic tokens (ASR-derived, phone-like)
Layer 1-N: Residual acoustic tokens (timbre, prosody, style)

CosyVoice 3 upgrades:

Feature	CosyVoice 2	CosyVoice 3
Parameters	0.5B	1.5B
Training data	~10K hours	1M hours
Languages	4	9 + 18 Chinese dialects
Tokenizer	ASR-based	Multi-task (ASR + emotion + language ID + audio events)
Streaming	~150ms	~150ms
Post-training	None	DiffRO (Differentiable Reward Optimization)
Pronunciation control	No	Yes (Pinyin + CMU phoneme inpainting)

CosyVoice 3 tokenizer training: The novel tokenizer is trained on multiple tasks simultaneously:

ASR (content accuracy)
Speech emotion recognition (prosody preservation)
Language identification (multilingual consistency)
Audio event detection (non-speech sounds)
Speaker analysis (timbre preservation)

This produces tokens that carry more information than any single-task tokenizer, enabling better prosody and emotion transfer during cloning.

Zero-shot cloning usage:

from cosyvoice.cli import CosyVoice2

model = CosyVoice2('pretrained_models/CosyVoice2-0.5B')
output = model.inference_zero_shot(
    tts_text="Hello, this is a cloned voice.",
    prompt_text="Reference text spoken in the prompt audio",
    prompt_speech_16k=ref_audio_16k,
)

Limitations:

Relatively large (0.5B-1.5B) compared to lightweight models
ASR-dependent tokenizer ties quality to ASR model quality
Flow matching decoder adds ~50-100ms additional latency over pure AR

5. GPT-SoVITS (RVC-Boss)

Snapshot: 2024-era open-source system | License: verify current terms | Scale: multi-component model

GPT-SoVITS is a popular few-shot voice cloning system in the open-source community. Its defining characteristic is support for short-reference and few-shot workflows, depending on setup and quality target.

Architecture:

Stage 1: GPT (Text → Semantic Tokens)
Text → Phonemes + BERT features → AR transformer (GPT) → semantic tokens
Reference audio → CNHuBERT → SSL features → conditions GPT

Stage 2: SoVITS (Semantic Tokens → Waveform)
Semantic tokens + Reference spectrogram → VITS-based decoder → waveform

Two-stage pipeline:

Raw text
  → Text processing: phoneme conversion + BERT feature extraction (1024-dim)
  → Combined with CNHuBERT SSL features from reference (768-dim)
  → GPT autoregressive decoder → semantic token sequence
  → SoVITS acoustic model (VITS variant) → mel-spectrogram
  → Neural vocoder → waveform

Key components:

Component	Detail
GPT model	Autoregressive transformer, predicts semantic tokens
SoVITS model	VITS-based, with improved posterior encoder and flow-based decoder
BERT encoder	Chinese RoBERTa (1024-dim) — contextual text embeddings
CNHuBERT	Chinese HuBERT — SSL features from reference audio
Vocoder	V3/V4: HiFi-GAN or neural vocoder (varies by version)
Min reference (zero-shot)	5 seconds
Min reference (few-shot)	1 minute
Languages	ZH, EN, JP, KO, Cantonese
Cross-lingual	Yes — clone from one language, generate in another
v2 ProPlus RTF	0.028 (4060Ti), 0.014 (4090), 0.526 (M4 CPU)

Few-shot fine-tuning workflow:

1. Record/spilt ~1 min of clean reference audio
2. ASR alignment (automatic — built into WebUI)
3. Fine-tune GPT + SoVITS models (10-30 min on consumer GPU)
4. Inference with fine-tuned weights

Why it works with so little data: The separation of semantic and acoustic generation is key. The GPT model only needs to learn the text-to-semantic-token mapping (which generalizes from pre-training). The SoVITS model adapts the acoustic details using the reference spectrogram as a guide. The CNHuBERT SSL features provide a strong prior on speaker identity without requiring many parameters.

Limitations:

Two-stage inference is non-streaming (generate tokens → decode audio)
Quality degrades on long text without chunking
BERT and CNHuBERT dependencies add model load time
Primarily optimized for Chinese; English quality trails dedicated English models
Complex dependency chain (multiple models must be loaded)

6. Fish Speech (Fish Audio)

Snapshot: Fish Speech / Fish Audio model family | License: verify release-specific terms | Scale: varies by release

Fish Speech’s distinctive architecture is the Dual-AR (Dual Autoregressive) design — two transformers running at different granularities.

Dual-AR Architecture:

Text → Slow Transformer (~4B) → Primary codebook tokens (temporal structure)
Primary tokens → Fast Transformer (~400M) → Residual codebook tokens (acoustic detail)
Combined tokens → Firefly-GAN vocoder → waveform

Why dual autoregressive: Standard single-AR models must predict all codebook levels at each step, which creates a conflict: the coarse semantic structure and the fine acoustic detail compete for modeling capacity.

Fish Speech solves this by separating the problem:

Slow AR (temporal modeling):
  "I am speaking this sentence" → primary token per frame
  Focus: content, prosody, pacing

Fast AR (acoustic modeling):
  Primary tokens → residual tokens (fine detail)
  Focus: timbre, breathiness, articulation

S2 Pro elevates this further with a ~4B slow AR and ~400M fast AR, trained on millions of hours of data.

Multilingual training scale (V1.5):

Language	Training Data	WER/CER
English	300K+ hours	3.5% WER
Chinese	300K+ hours	1.3% CER
Japanese	100K+ hours	—
German, French, Spanish, Korean, Arabic, Russian	~20K hours each	—
Dutch, Italian, Polish, Portuguese	<10K hours each	—

S2 Pro upgrades:

Feature	V1.5	S2 Pro
Slow AR	~500M	~4B
Fast AR	~100M	~400M
Training data	1M hours	10M+ hours
Languages	13	80+
Voice cloning	Zero-shot (10-15s)	Zero-shot (10-30s)
Latency	~200ms	~100ms (via SGLang)
ELO (TTS Arena)	1011	1339
Multi-speaker	No	Yes (native via `<\|speaker:i\|>` tokens)

Voice cloning mechanism: Fish Speech uses in-context learning. Reference audio is encoded through the same codec pipeline, producing token sequences that are included as context in the autoregressive generation. The Slow AR attends to the reference token patterns to reproduce the speaker’s timbre and style.

# Conceptual: S2 Pro voice cloning
ref_tokens = encode_audio(reference_wav)  # 10-30 seconds
prompt = ref_tokens + text_tokens
output_tokens = slow_ar.generate(prompt) + fast_ar.generate(output_tokens)
waveform = firefly_gan.decode(output_tokens)

Firefly-GAN vocoder: Fish Speech’s vocoder uses depthwise/dilated convolutions with grouped scalar vector quantization. It achieves near 100% codebook utilization (unlike RVQ codecs which typically leave many codes unused), enabling more expressive and detailed audio output.

Limitations:

CC-BY-NC-SA license prevents commercial use (V1.5)
S2 Pro is effectively proprietary (weights available, inference via API)
Dual-AR adds architectural complexity vs single-model approaches
S2 Pro requires significant GPU resources at ~4.4B parameters

7. Orpheus (Canopy Labs)

Snapshot: 2025-era model family | License: verify current terms | Scale: large local model

Covered in depth in the Orpheus deep dive, but relevant here for its cloning approach.

Voice cloning mechanism: Orpheus does not use a speaker encoder. Instead, voice cloning is done through fine-tuning the Llama-3.2-3B backbone on target speaker data. The finetuned model has 8 preset voices; custom voices require additional fine-tuning.

Preset voices: tara, leah, jess, leo, dan, mia, zac, zoe
→ Built from fine-tuning on ~50-300 examples per voice

Custom voice cloning:
→ Collect 50-300 audio examples per speaker
→ Fine-tune Llama backbone (standard HuggingFace Trainer)
→ Model learns to associate the speaker's acoustic patterns with text

The pretrained model supports a limited form of in-context conditioning by including text-speech pairs in the prompt, but this is less reliable than dedicated speaker encoder approaches.

Key specs:

8 preset voices (English)
7 language pairs in multilingual research release
3B parameters — requires GPU
Apache 2.0 license
Fine-tuning data: 50-300 examples per speaker recommended

8. Qwen3-TTS (Alibaba / Qwen Team)

Snapshot: 2026-era model family | License: verify current terms | Scale: varies by release

Covered in depth in the Qwen3-TTS deep dive. Its voice cloning approach is distinct.

Voice cloning mechanism: Qwen3-TTS uses a learnable speaker encoder trained jointly with the dual-track LM backbone. Reference audio (3 seconds) is encoded through the Qwen-TTS-Tokenizer, and a speaker embedding is extracted and used to condition every generation step.

Reference audio (3s) → Qwen-TTS-Tokenizer → speech codes
                     → Learnable speaker encoder → speaker embedding (conditions LM)

Dual-track LM: text + speaker embedding → 12Hz codec codes → causal ConvNet → waveform

Evaluation points:

Short-reference cloning behavior
Supported languages
Streaming latency on target hardware
Speaker similarity across long passages
Text-description-based voice design, if available in the release

9. Chatterbox (Resemble AI)

Snapshot: 2025-era model family | License: verify current terms | Scale: medium local model

Covered in the Chatterbox deep dive. Chatterbox takes a three-stage approach to cloning.

Voice cloning mechanism:

Reference audio (5s+)
  → S3 tokenizer → 150 speech tokens (conditioning prompt)
  → CAMPPlus speaker encoder → 256-dim x-vector (speaker embedding)
  → Both condition the T3 Llama backbone

T3 (AR): text + conditioning → S3 tokens
S3Gen (CFM): S3 tokens → mel-spectrogram
HiFi-GAN: mel → waveform

Evaluation points:

Short-reference cloning behavior
English and multilingual release differences
Expressiveness controls available in the current release
License terms
Provenance or watermarking support, if available

Comparison Table

Model	Cloning Method	Ref Audio	Languages	Scale	License	Streaming	Vocoder	Snapshot
XTTS-v2	Conditioning (Perceiver)	Short to medium	Multilingual	Large	Verify	Runtime-dependent	HiFi-GAN	2023
OpenVoice V2	Conditioning (decoupled)	Short	Multilingual	Smaller	Verify	Runtime-dependent	HiFi-GAN	2024
ElevenLabs	Conditioning + fine-tune products	Product-dependent	Multilingual	Unknown	Proprietary	Cloud/network dependent	Proprietary	2023+
CosyVoice	Conditioning (ASR tokens)	Short to medium	Multilingual	Varies	Verify	Runtime-dependent	HiFi-GAN	2024+
GPT-SoVITS	Fine-tune / few-shot	Short to medium	Multilingual	Multi-component	Verify	Runtime-dependent	VITS/HiFi-GAN	2024
Fish Speech	In-context (Dual-AR)	Short to medium	Multilingual	Varies	Verify	Runtime-dependent	Firefly-GAN	2024+
Orpheus	Fine-tune / prompted variants	Release-dependent	English-focused / research multilingual variants	Large	Verify	Runtime-dependent	SNAC	2025
Qwen3-TTS	Conditioning (speaker encoder)	Short-reference claims to verify	Multilingual releases	Varies	Verify	Runtime-dependent	Release-dependent	2026
Chatterbox	Conditioning	Short-reference cloning	English-focused / release-dependent	Medium	Verify	Runtime-dependent	HiFi-GAN	2025

Cloning method legend: Conditioning = speaker embedding extracted at inference; Fine-tune = weights updated per speaker; In-context = audio tokens in prompt

Decision Guide: Which Model to Use

Your Priority	Candidate Models	What to Verify
Quality	ElevenLabs, Fish Speech, Qwen-family, Chatterbox	Current output quality on your text and voice
Language coverage	Fish Speech, CosyVoice, Qwen-family, cloud services	Current model card and target-language pronunciation
Small local setup	OpenVoice, lightweight local models	Hardware needs and acceptable quality
Expressiveness	Chatterbox, Orpheus-style, cloud services	Supported controls in the exact release
Low latency	Cloud realtime products, Qwen-family, optimized local runtimes	Measured latency on target hardware/network
Few-shot fine-tuning	GPT-SoVITS, adaptation-based systems	Data requirement, setup complexity, license
Commercial use	CosyVoice, OpenVoice, Chatterbox, other permissive releases	Current license and model provenance
Voice design	Qwen-family or cloud voice design products	Whether it is cloning, voice design, or both
Privacy / offline	OpenVoice, CosyVoice, Chatterbox, XTTS-style local tools	Whether all processing stays local
Cross-lingual cloning	OpenVoice, XTTS-style, cloud services	Target-language quality and consent scope

Common Failure Modes Across All Models

Failure Mode	Cause	Affected Models
Timbre drift	Reference too short or noisy	All models
Style bleeding	Cloned voice inherits undesired emotion from reference	XTTS-v2, CosyVoice, Chatterbox
Cross-lingual accent	Language-specific phone inventory mismatch	XTTS-v2, GPT-SoVITS
Repetition loops	AR model gets stuck on tokens	Orpheus, GPT-SoVITS, Fish Speech
Robotic prosody	Conditioning vector loses prosodic nuance	OpenVoice (base TTS bottleneck)
Poor consistency	Embedding varies between generations	XTTS-v2 (Perceiver instability)
Vocoder artifacts	Out-of-distribution acoustic features	All GAN-based vocoders

Summary

Voice cloning models fall into three technical categories (conditioning, fine-tuning, in-context learning), and the best choice depends on your constraints. The open-source landscape in 2026 offers several useful options, but model cards, licenses, language support, and runtime behavior change quickly. Verify the exact release before using any model commercially.

The proprietary frontier can still be strong on hosted quality and production tooling, while local models keep improving for privacy-sensitive and offline workflows.

Voice Cloning Models: How They Work — A Survey of Major Approaches

The Three Core Approaches

Speaker Adaptation

Speaker Conditioning

In-Context Learning

Model-by-Model Architecture Survey

1. XTTS-v2 (Coqui)

2. OpenVoice V2 (MyShell / MIT)

3. ElevenLabs (Proprietary)

4. CosyVoice (Alibaba / FunAudioLLM)

5. GPT-SoVITS (RVC-Boss)

6. Fish Speech (Fish Audio)

7. Orpheus (Canopy Labs)

8. Qwen3-TTS (Alibaba / Qwen Team)

9. Chatterbox (Resemble AI)

Comparison Table

Decision Guide: Which Model to Use

Common Failure Modes Across All Models

Summary

More from the blog

Try Spokio for Mac.

Product

Features

Use Cases

Compare