Voice cloning is the ability to synthesize speech that sounds like a specific person, given a reference audio sample of that person speaking. It is distinct from multi-speaker TTS (which selects from a fixed set of pre-enrolled voices) and from voice conversion (which transforms one speaker’s voice into another’s while preserving content).
This article surveys several major voice cloning models — open-source and proprietary — covering how each approaches the problem architecturally, what tradeoffs they make, and where the field stands as of 2026.
The Three Core Approaches
Most voice cloning systems can be understood through one of three categories:
| Approach | How It Works | Reference Audio Needed | Quality | Use Case |
|---|---|---|---|---|
| Speaker Adaptation | Fine-tune model weights on target voice | Often minutes or more | High consistency when tuned well | Production voices, consistent output |
| Speaker Conditioning | Extract embedding, condition generation | Often seconds to minutes | Strong for quick cloning | Quick cloning, instant use |
| In-Context Learning | Prompt model with reference tokens | Often short samples | Model-dependent | Zero-shot or few-shot workflows |
Speaker Adaptation
The model receives additional training on target-speaker audio. Gradients update the weights to specialize for that voice.
Base model weights → Fine-tune on target audio (100-1000 steps) → Specialized weightsPros: Highest quality, most consistent across diverse text. Captures subtle prosodic patterns.
Cons: Takes minutes to hours of training. Requires GPU. Model size grows per speaker. Cannot switch speakers without reloading.
Used by: ElevenLabs Professional Voice Cloning, GPT-SoVITS (few-shot mode), Orpheus fine-tuning, Coqui XTTS fine-tuning.
Speaker Conditioning
The model extracts a fixed-dimensional vector (embedding) from the reference audio and injects it as a conditioning signal at generation time. No weight updates.
Reference audio → Speaker encoder → d-vector / embedding → Conditions AR decoder
Text → AR decoder (conditioned) → Audio tokens → Vocoder → WaveformPros: Instant — no waiting. Single model handles unlimited speakers. Switch speakers between sentences.
Cons: Quality capped by encoder capacity. Less consistent across diverse text styles. Sensitive to reference audio quality.
Used by: XTTS-v2, OpenVoice, CosyVoice, Chatterbox, Qwen3-TTS.
In-Context Learning
The model is a language model trained on interleaved text and audio tokens. Cloning is done by including reference audio tokens in the prompt, similar to few-shot prompting in LLMs.
Prompt: [ref_audio_tokens] [ref_text] [gen_marker] [target_text]
→ LLM autoregressively generates target audio tokensPros: No explicit speaker encoder needed. Can leverage any number of reference examples. Emergent cross-lingual transfer.
Cons: Prompt length grows with reference. May overfit to reference prosody. Sensitive to prompt formatting.
Used by: Orpheus (pretrained mode), CosyVoice (zero-shot mode), Fish Speech.
Model-by-Model Architecture Survey
1. XTTS-v2 (Coqui)
Snapshot: 2023-era release | License: verify current terms before commercial use | Scale: large open model
XTTS-v2 is a GPT2-based autoregressive model with a Perceiver-based speaker conditioning mechanism.
Architecture:
Reference audio (6s+) → Mel-spectrogram → Perceiver encoder → 32 latent vectors
Text → GPT2 backbone (cross-attends to latents)
→ Discrete audio tokens (VQ-VAE codes) → HiFi-GAN → 24kHz waveformKey components:
| Component | Detail |
|---|---|
| Backbone | GPT2 (decoder-only transformer) |
| Speaker encoder | Perceiver — inputs mel-spectrogram, outputs 32 fixed latent vectors |
| Conditioning | Latent vectors prefixed to GPT2 input sequence (similar to a soft prompt) |
| Audio tokenizer | VQ-VAE with discrete codebook |
| Vocoder | HiFi-GAN |
| Languages | 17 (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi) |
| Min reference | ~6 seconds |
| Streaming | ~150ms latency (Pure PyTorch, consumer GPU) |
Perceiver mechanism detail: The Perceiver replaces the simpler encoder used in XTTS-v1. Instead of a single vector, it produces 32 latent vectors that capture different aspects of the speaker’s voice. These vectors are prepended to the GPT2’s input sequence as a “soft prompt.” This design:
- Allows multiple reference audio clips (concatenated before encoding)
- Enables speaker interpolation (blending two references)
- Produces more consistent speaker identity than single-vector approaches
XTTS-v2 pipeline:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
tts.tts_to_file(
text="Hello, this is a cloned voice.",
speaker_wav="/path/to/reference.wav",
language="en",
file_path="output.wav"
)Limitations:
- Non-commercial license (CPML)
- Perceiver occasionally produces unstable speaker embeddings with very short (<3s) references
- Coqui AI shut down in 2024 — no further development; community forks exist
2. OpenVoice V2 (MyShell / MIT)
Snapshot: V1/V2-era releases | License: verify current terms | Scale: smaller local model family
OpenVoice’s defining innovation is decoupling tone color from voice style. Most cloning systems couple these — the cloned voice inherits both the timbre and the prosody/accent/emotion of the reference. OpenVoice separates them.
Architecture:
Reference audio → Tone Color Encoder → tone color embedding
Target text → Base Speaker TTS → neutral mel-spectrogram
→ Tone Color Converter (VS network) → mel with target timbre
→ HiFi-GAN → waveform
Style params (emotion, accent, speed) → Style Encoder → style embeddingDecoupled design:
The OpenVoice pipeline runs in two stages:
1. A single-speaker TTS generates neutral speech from text
2. A tone color converter transplants the reference speaker's timbre
onto the neutral speech, guided by the style embeddingKey components:
| Component | Detail |
|---|---|
| Base TTS | Single-speaker model (trained on ~30K sentences from 4 speakers) |
| Tone color encoder | Extracts speaker embedding (timbre only, stripped of style) |
| Style encoder | Extracts/accent emotion/accent from reference |
| Tone color converter | Visinger (VS) network with flow layers — transplants timbre |
| Vocoder | HiFi-GAN |
| Min reference | ~1 second |
| Languages | 6 native (EN, ES, FR, ZH, JA, KO), zero-shot cross-lingual for others |
| Style control | Emotion (happy, sad, angry), accent (British, Indian, Australian), speed |
Why decoupling matters: In XTTS-v2 and similar models, the cloned voice inherits the reference speaker’s emotion and accent. If your reference is a sad recording, your cloning sounds sad. OpenVoice can take timbre from reference A and emotion from reference B (or from a text description).
# Conceptual: separate timbre and style
timbre = extract_tone_color("reference.wav")
style = extract_style("reference.wav") # or synthetic: style = "happy, british accent"
output = synthesize(text, timbre=timbre, style=style)Zero-shot cross-lingual mechanism: Because the base TTS model speaks fluently in multiple languages, and the tone color converter only modifies timbre, the cloned voice inherits the base TTS’s pronunciation. This means:
- A voice cloned from English can speak French fluently (using the French base TTS)
- Works for languages the tone color converter never trained on
- No multilingual training data required for new languages
Limitations:
- Two-pass pipeline doubles inference time
- Tone color converter can introduce artifacts on extreme style combinations
- Base TTS quality caps overall output quality
- V2 quality improved significantly but still trails end-to-end models for naturalness
3. ElevenLabs (Proprietary)
Snapshot: cloud service, proprietary model family
ElevenLabs operates at a scale and with a level of investment that no open-source project matches. Its architecture is not public, but enough is known from documentation and analysis to reconstruct a likely design.
Two cloning tiers:
| Feature | Instant Voice Cloning | Professional Voice Cloning |
|---|---|---|
| Mechanism | Few-shot conditioning at inference | Fine-tuning model weights |
| Reference audio | 1-5 minutes | 30+ minutes |
| Turnaround | Seconds | Minutes |
| Quality | Good, style-dependent | Excellent, consistent |
| Use case | Quick prototyping | Production, brand voices |
Likely architecture (reconstructed):
Instant VC:
Reference audio → Learned speaker encoder → conditioning vector(s)
Text → Large AR decoder (cross-attends to conditioning) → audio codes
→ Neural codec decoder → waveform
Professional VC:
Base model → LoRA or full fine-tune on target speaker → specialized weights
→ Same AR decoder → consistently cloned outputWhat is publicly described or commonly reported:
- Models: multiple hosted TTS and voice models with multilingual support
- Latency: low-latency options are available, but exact performance depends on model, region, and network
- Speech-to-Speech: Separate model that transforms input audio while preserving content and emotion
- Voice Design: Text-prompted voice creation (not cloning, but generative)
- Voice Library: community and licensed voice workflows are available
Key differentiators:
- Data scale: proprietary datasets and production feedback loops
- Model scale: undisclosed
- Audio quality: strong hosted output in many workflows
- Emotion control: model- and product-specific controls
Why ElevenLabs is hard to replicate:
- Training data — years of curated, high-quality multilingual speech from professional sources
- Model architecture — custom-designed, not a stock Llama or GPT2
- Scale — likely trained on thousands of GPUs with proprietary infrastructure
- Perceptual optimization — models optimized through human evaluation loops, not just loss curves
Limitations:
- Proprietary — you do not control latency, availability, or pricing
- API-dependent — no offline inference
- Cost — scales linearly with usage
- Voice cloning quality is inconsistent across reference styles
4. CosyVoice (Alibaba / FunAudioLLM)
Snapshot: actively developed open model family | License: verify current release terms | Scale: varies by release
CosyVoice is an actively developed open-source voice cloning model family. Its architecture uses supervised semantic tokens — a key innovation.
Architecture (CosyVoice 2):
Text → [LLM (autoregressive)] → supervised semantic tokens
Semantic tokens → [Flow Matching (non-autoregressive)] → mel-spectrogram
→ [HiFi-GAN] → waveform
Speaker conditioning:
Reference audio → Speaker encoder → embedding → conditions LLM + Flow MatchingThe supervised semantic token innovation: Unlike XTTS-v2 (which uses unsupervised VQ-VAE tokens) and Orpheus (which uses SNAC codec tokens), CosyVoice uses tokens derived from a multilingual speech recognition (ASR) model. These tokens have explicit alignment to text phonemes, which means:
- The LLM does not need to learn text-to-phoneme alignment — it is built into the token representation
- Speaker identity and prosody remain in the residual codec layers
- Cross-lingual cloning works because the semantics are language-agnostic
Token hierarchy:
Layer 0: Supervised semantic tokens (ASR-derived, phone-like)
Layer 1-N: Residual acoustic tokens (timbre, prosody, style)CosyVoice 3 upgrades:
| Feature | CosyVoice 2 | CosyVoice 3 |
|---|---|---|
| Parameters | 0.5B | 1.5B |
| Training data | ~10K hours | 1M hours |
| Languages | 4 | 9 + 18 Chinese dialects |
| Tokenizer | ASR-based | Multi-task (ASR + emotion + language ID + audio events) |
| Streaming | ~150ms | ~150ms |
| Post-training | None | DiffRO (Differentiable Reward Optimization) |
| Pronunciation control | No | Yes (Pinyin + CMU phoneme inpainting) |
CosyVoice 3 tokenizer training: The novel tokenizer is trained on multiple tasks simultaneously:
- ASR (content accuracy)
- Speech emotion recognition (prosody preservation)
- Language identification (multilingual consistency)
- Audio event detection (non-speech sounds)
- Speaker analysis (timbre preservation)
This produces tokens that carry more information than any single-task tokenizer, enabling better prosody and emotion transfer during cloning.
Zero-shot cloning usage:
from cosyvoice.cli import CosyVoice2
model = CosyVoice2('pretrained_models/CosyVoice2-0.5B')
output = model.inference_zero_shot(
tts_text="Hello, this is a cloned voice.",
prompt_text="Reference text spoken in the prompt audio",
prompt_speech_16k=ref_audio_16k,
)Limitations:
- Relatively large (0.5B-1.5B) compared to lightweight models
- ASR-dependent tokenizer ties quality to ASR model quality
- Flow matching decoder adds ~50-100ms additional latency over pure AR
5. GPT-SoVITS (RVC-Boss)
Snapshot: 2024-era open-source system | License: verify current terms | Scale: multi-component model
GPT-SoVITS is a popular few-shot voice cloning system in the open-source community. Its defining characteristic is support for short-reference and few-shot workflows, depending on setup and quality target.
Architecture:
Stage 1: GPT (Text → Semantic Tokens)
Text → Phonemes + BERT features → AR transformer (GPT) → semantic tokens
Reference audio → CNHuBERT → SSL features → conditions GPT
Stage 2: SoVITS (Semantic Tokens → Waveform)
Semantic tokens + Reference spectrogram → VITS-based decoder → waveformTwo-stage pipeline:
Raw text
→ Text processing: phoneme conversion + BERT feature extraction (1024-dim)
→ Combined with CNHuBERT SSL features from reference (768-dim)
→ GPT autoregressive decoder → semantic token sequence
→ SoVITS acoustic model (VITS variant) → mel-spectrogram
→ Neural vocoder → waveformKey components:
| Component | Detail |
|---|---|
| GPT model | Autoregressive transformer, predicts semantic tokens |
| SoVITS model | VITS-based, with improved posterior encoder and flow-based decoder |
| BERT encoder | Chinese RoBERTa (1024-dim) — contextual text embeddings |
| CNHuBERT | Chinese HuBERT — SSL features from reference audio |
| Vocoder | V3/V4: HiFi-GAN or neural vocoder (varies by version) |
| Min reference (zero-shot) | 5 seconds |
| Min reference (few-shot) | 1 minute |
| Languages | ZH, EN, JP, KO, Cantonese |
| Cross-lingual | Yes — clone from one language, generate in another |
| v2 ProPlus RTF | 0.028 (4060Ti), 0.014 (4090), 0.526 (M4 CPU) |
Few-shot fine-tuning workflow:
1. Record/spilt ~1 min of clean reference audio
2. ASR alignment (automatic — built into WebUI)
3. Fine-tune GPT + SoVITS models (10-30 min on consumer GPU)
4. Inference with fine-tuned weightsWhy it works with so little data: The separation of semantic and acoustic generation is key. The GPT model only needs to learn the text-to-semantic-token mapping (which generalizes from pre-training). The SoVITS model adapts the acoustic details using the reference spectrogram as a guide. The CNHuBERT SSL features provide a strong prior on speaker identity without requiring many parameters.
Limitations:
- Two-stage inference is non-streaming (generate tokens → decode audio)
- Quality degrades on long text without chunking
- BERT and CNHuBERT dependencies add model load time
- Primarily optimized for Chinese; English quality trails dedicated English models
- Complex dependency chain (multiple models must be loaded)
6. Fish Speech (Fish Audio)
Snapshot: Fish Speech / Fish Audio model family | License: verify release-specific terms | Scale: varies by release
Fish Speech’s distinctive architecture is the Dual-AR (Dual Autoregressive) design — two transformers running at different granularities.
Dual-AR Architecture:
Text → Slow Transformer (~4B) → Primary codebook tokens (temporal structure)
Primary tokens → Fast Transformer (~400M) → Residual codebook tokens (acoustic detail)
Combined tokens → Firefly-GAN vocoder → waveformWhy dual autoregressive: Standard single-AR models must predict all codebook levels at each step, which creates a conflict: the coarse semantic structure and the fine acoustic detail compete for modeling capacity.
Fish Speech solves this by separating the problem:
Slow AR (temporal modeling):
"I am speaking this sentence" → primary token per frame
Focus: content, prosody, pacing
Fast AR (acoustic modeling):
Primary tokens → residual tokens (fine detail)
Focus: timbre, breathiness, articulationS2 Pro elevates this further with a ~4B slow AR and ~400M fast AR, trained on millions of hours of data.
Multilingual training scale (V1.5):
| Language | Training Data | WER/CER |
|---|---|---|
| English | 300K+ hours | 3.5% WER |
| Chinese | 300K+ hours | 1.3% CER |
| Japanese | 100K+ hours | — |
| German, French, Spanish, Korean, Arabic, Russian | ~20K hours each | — |
| Dutch, Italian, Polish, Portuguese | <10K hours each | — |
S2 Pro upgrades:
| Feature | V1.5 | S2 Pro |
|---|---|---|
| Slow AR | ~500M | ~4B |
| Fast AR | ~100M | ~400M |
| Training data | 1M hours | 10M+ hours |
| Languages | 13 | 80+ |
| Voice cloning | Zero-shot (10-15s) | Zero-shot (10-30s) |
| Latency | ~200ms | ~100ms (via SGLang) |
| ELO (TTS Arena) | 1011 | 1339 |
| Multi-speaker | No | Yes (native via <|speaker:i|> tokens) |
Voice cloning mechanism: Fish Speech uses in-context learning. Reference audio is encoded through the same codec pipeline, producing token sequences that are included as context in the autoregressive generation. The Slow AR attends to the reference token patterns to reproduce the speaker’s timbre and style.
# Conceptual: S2 Pro voice cloning
ref_tokens = encode_audio(reference_wav) # 10-30 seconds
prompt = ref_tokens + text_tokens
output_tokens = slow_ar.generate(prompt) + fast_ar.generate(output_tokens)
waveform = firefly_gan.decode(output_tokens)Firefly-GAN vocoder: Fish Speech’s vocoder uses depthwise/dilated convolutions with grouped scalar vector quantization. It achieves near 100% codebook utilization (unlike RVQ codecs which typically leave many codes unused), enabling more expressive and detailed audio output.
Limitations:
- CC-BY-NC-SA license prevents commercial use (V1.5)
- S2 Pro is effectively proprietary (weights available, inference via API)
- Dual-AR adds architectural complexity vs single-model approaches
- S2 Pro requires significant GPU resources at ~4.4B parameters
7. Orpheus (Canopy Labs)
Snapshot: 2025-era model family | License: verify current terms | Scale: large local model
Covered in depth in the Orpheus deep dive, but relevant here for its cloning approach.
Voice cloning mechanism: Orpheus does not use a speaker encoder. Instead, voice cloning is done through fine-tuning the Llama-3.2-3B backbone on target speaker data. The finetuned model has 8 preset voices; custom voices require additional fine-tuning.
Preset voices: tara, leah, jess, leo, dan, mia, zac, zoe
→ Built from fine-tuning on ~50-300 examples per voice
Custom voice cloning:
→ Collect 50-300 audio examples per speaker
→ Fine-tune Llama backbone (standard HuggingFace Trainer)
→ Model learns to associate the speaker's acoustic patterns with textThe pretrained model supports a limited form of in-context conditioning by including text-speech pairs in the prompt, but this is less reliable than dedicated speaker encoder approaches.
Key specs:
- 8 preset voices (English)
- 7 language pairs in multilingual research release
- 3B parameters — requires GPU
- Apache 2.0 license
- Fine-tuning data: 50-300 examples per speaker recommended
8. Qwen3-TTS (Alibaba / Qwen Team)
Snapshot: 2026-era model family | License: verify current terms | Scale: varies by release
Covered in depth in the Qwen3-TTS deep dive. Its voice cloning approach is distinct.
Voice cloning mechanism: Qwen3-TTS uses a learnable speaker encoder trained jointly with the dual-track LM backbone. Reference audio (3 seconds) is encoded through the Qwen-TTS-Tokenizer, and a speaker embedding is extracted and used to condition every generation step.
Reference audio (3s) → Qwen-TTS-Tokenizer → speech codes
→ Learnable speaker encoder → speaker embedding (conditions LM)
Dual-track LM: text + speaker embedding → 12Hz codec codes → causal ConvNet → waveformEvaluation points:
- Short-reference cloning behavior
- Supported languages
- Streaming latency on target hardware
- Speaker similarity across long passages
- Text-description-based voice design, if available in the release
9. Chatterbox (Resemble AI)
Snapshot: 2025-era model family | License: verify current terms | Scale: medium local model
Covered in the Chatterbox deep dive. Chatterbox takes a three-stage approach to cloning.
Voice cloning mechanism:
Reference audio (5s+)
→ S3 tokenizer → 150 speech tokens (conditioning prompt)
→ CAMPPlus speaker encoder → 256-dim x-vector (speaker embedding)
→ Both condition the T3 Llama backbone
T3 (AR): text + conditioning → S3 tokens
S3Gen (CFM): S3 tokens → mel-spectrogram
HiFi-GAN: mel → waveformEvaluation points:
- Short-reference cloning behavior
- English and multilingual release differences
- Expressiveness controls available in the current release
- License terms
- Provenance or watermarking support, if available
Comparison Table
| Model | Cloning Method | Ref Audio | Languages | Scale | License | Streaming | Vocoder | Snapshot |
|---|---|---|---|---|---|---|---|---|
| XTTS-v2 | Conditioning (Perceiver) | Short to medium | Multilingual | Large | Verify | Runtime-dependent | HiFi-GAN | 2023 |
| OpenVoice V2 | Conditioning (decoupled) | Short | Multilingual | Smaller | Verify | Runtime-dependent | HiFi-GAN | 2024 |
| ElevenLabs | Conditioning + fine-tune products | Product-dependent | Multilingual | Unknown | Proprietary | Cloud/network dependent | Proprietary | 2023+ |
| CosyVoice | Conditioning (ASR tokens) | Short to medium | Multilingual | Varies | Verify | Runtime-dependent | HiFi-GAN | 2024+ |
| GPT-SoVITS | Fine-tune / few-shot | Short to medium | Multilingual | Multi-component | Verify | Runtime-dependent | VITS/HiFi-GAN | 2024 |
| Fish Speech | In-context (Dual-AR) | Short to medium | Multilingual | Varies | Verify | Runtime-dependent | Firefly-GAN | 2024+ |
| Orpheus | Fine-tune / prompted variants | Release-dependent | English-focused / research multilingual variants | Large | Verify | Runtime-dependent | SNAC | 2025 |
| Qwen3-TTS | Conditioning (speaker encoder) | Short-reference claims to verify | Multilingual releases | Varies | Verify | Runtime-dependent | Release-dependent | 2026 |
| Chatterbox | Conditioning | Short-reference cloning | English-focused / release-dependent | Medium | Verify | Runtime-dependent | HiFi-GAN | 2025 |
Cloning method legend: Conditioning = speaker embedding extracted at inference; Fine-tune = weights updated per speaker; In-context = audio tokens in prompt
Decision Guide: Which Model to Use
| Your Priority | Candidate Models | What to Verify |
|---|---|---|
| Quality | ElevenLabs, Fish Speech, Qwen-family, Chatterbox | Current output quality on your text and voice |
| Language coverage | Fish Speech, CosyVoice, Qwen-family, cloud services | Current model card and target-language pronunciation |
| Small local setup | OpenVoice, lightweight local models | Hardware needs and acceptable quality |
| Expressiveness | Chatterbox, Orpheus-style, cloud services | Supported controls in the exact release |
| Low latency | Cloud realtime products, Qwen-family, optimized local runtimes | Measured latency on target hardware/network |
| Few-shot fine-tuning | GPT-SoVITS, adaptation-based systems | Data requirement, setup complexity, license |
| Commercial use | CosyVoice, OpenVoice, Chatterbox, other permissive releases | Current license and model provenance |
| Voice design | Qwen-family or cloud voice design products | Whether it is cloning, voice design, or both |
| Privacy / offline | OpenVoice, CosyVoice, Chatterbox, XTTS-style local tools | Whether all processing stays local |
| Cross-lingual cloning | OpenVoice, XTTS-style, cloud services | Target-language quality and consent scope |
Common Failure Modes Across All Models
| Failure Mode | Cause | Affected Models |
|---|---|---|
| Timbre drift | Reference too short or noisy | All models |
| Style bleeding | Cloned voice inherits undesired emotion from reference | XTTS-v2, CosyVoice, Chatterbox |
| Cross-lingual accent | Language-specific phone inventory mismatch | XTTS-v2, GPT-SoVITS |
| Repetition loops | AR model gets stuck on tokens | Orpheus, GPT-SoVITS, Fish Speech |
| Robotic prosody | Conditioning vector loses prosodic nuance | OpenVoice (base TTS bottleneck) |
| Poor consistency | Embedding varies between generations | XTTS-v2 (Perceiver instability) |
| Vocoder artifacts | Out-of-distribution acoustic features | All GAN-based vocoders |
Summary
Voice cloning models fall into three technical categories (conditioning, fine-tuning, in-context learning), and the best choice depends on your constraints. The open-source landscape in 2026 offers several useful options, but model cards, licenses, language support, and runtime behavior change quickly. Verify the exact release before using any model commercially.
The proprietary frontier can still be strong on hosted quality and production tooling, while local models keep improving for privacy-sensitive and offline workflows.
