You have a voice in your head — a specific speaker, a character, a client’s narration style — and you want your text-to-speech output to sound like that voice, not a generic announcer.
This guide shows one practical way to think about that workflow, step by step.
Not a theory discussion. Not a realism debate. A concrete workflow: what you need, which tool to pick, and how to run each one from start to finish.
We cover three common approaches — open-source local tools, cloud services, and local Mac-oriented experiments — so you can choose based on your hardware, privacy requirements, and quality needs.
What You Need to Clone a Voice
Before touching any tool, understand what the model needs from you. The quality of your clone depends almost entirely on the reference audio you provide.
Reference Audio Quality
The model learns the speaker’s voice from your recording. Garbage in, garbage out applies more strictly here than almost any other AI task.
Minimum specs for reference audio:
- Sample rate: 16 kHz or higher (44.1 kHz ideal)
- Format: WAV or high-bitrate MP3 (320 kbps). Avoid Opus, low-bitrate AAC, or any lossy compression under 192 kbps.
- Signal-to-noise ratio: The recording should be clean. Background hum, fan noise, reverb, and echo all get baked into the clone. If the model learns a voice with a subtle room echo, every generation will have that echo.
- Consistent volume: No clipping, no sections where the speaker moves away from the mic. Normalize to -3 dB to -1 dB peak before using it as reference.
- Single speaker only: If two people are on the recording, the model averages them together. You get a hybrid voice that sounds like neither.
How Much Audio Do You Need?
This is the most common question and the answer depends on the tool:
| Tool | Minimum | Recommended | Sweet Spot |
|---|---|---|---|
| XTTS-v2 | 3 seconds | 10-30 seconds | 30 seconds |
| ElevenLabs Instant | ~1 minute | 3-5 minutes | 5 minutes |
| ElevenLabs Professional | 30 minutes | 1-2 hours | 2 hours |
| Qwen3-TTS (MLX) | 3 seconds | 10-20 seconds | 20 seconds |
The “sweet spot” is where quality starts to plateau for a specific tool. Adding more audio beyond that point may give diminishing returns, but requirements change by model and product.
What Should the Reference Audio Contain?
The content of your recording matters as much as the length.
Good reference audio includes:
- Varied phonemes: The recording should cover the full phonetic range of the target language. A recording about “the weather today” misses many consonant clusters and vowel combinations. Read a passage designed for phonetic coverage — a “rainbow passage” or a script that explicitly covers all phonemes.
- Varied intonation: Statements, questions, exclamations. A monotone reading gives the model no prosody data to work with.
- Natural delivery: The speaker should talk in their natural voice, not a “recording voice.” Most people tighten their throat when they know they are being recorded. The clone inherits that tension.
- Target language: If you plan to generate speech in French, the reference should ideally include French audio. Cross-lingual cloning can work in some tools, but a same-language reference often produces better results.
Bad reference audio includes:
- News broadcast recordings (compressed, processed, often not the speaker’s natural voice)
- Phone call recordings (narrowband, compressed)
- YouTube videos (lossy audio, often mixed with music/effects)
- Recordings with the speaker reading slowly and deliberately (robot-like rhythm gets cloned)
Tools for Preparing Reference Audio
You do not need expensive software. These free tools handle the essentials:
- Audacity — noise reduction, normalization, trimming, format conversion
- FFmpeg — command-line format and sample rate conversion
- iZotope RX (commercial) — advanced noise removal if you work with imperfect source audio
Approach 1: XTTS-v2 Locally with Python (Open-Source, 17 Languages)
XTTS-v2 is an established open-source voice cloning model. It is not the newest, but it has a large community, useful documentation, and multilingual support.
This approach is best for developers who want full control, no recurring costs, and the ability to run everything offline.
What You Need
- A computer with a GPU (Apple Silicon Mac, or any NVIDIA GPU with 4GB+ VRAM)
- Python 3.10 or newer
- 5 GB free disk space for the model
- Your reference audio file (WAV, 16-44.1 kHz, cleaned and normalized)
Step 1: Install Dependencies
pip install TTS torch torchaudio soundfileOn macOS with Apple Silicon, PyTorch uses the MPS backend automatically. On Linux/Windows, ensure CUDA is available:
python -c "import torch; print(torch.cuda.is_available())"If it returns False, install the CUDA-enabled PyTorch from pytorch.org.
Step 2: Load the Model and Clone
import torch
from TTS.api import TTS
device = "cuda" if torch.cuda.is_available() else "mps"
# Load XTTS-v2
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
# Clone voice and generate speech
tts.tts_to_file(
text="This is my cloned voice speaking. It captures the timbre and cadence of the original speaker.",
file_path="output.wav",
speaker_wav="path/to/reference.wav",
language="en",
)That is the complete workflow. The model loads the reference audio, extracts a speaker embedding, and generates speech conditioned on that embedding.
Step 3: Cross-Lingual Cloning
XTTS-v2 supports generating speech in one language using a reference from another. The voice identity transfers across languages:
# English reference → French output
tts.tts_to_file(
text="Bonjour, ceci est ma voix clonée parlant français.",
file_path="output_fr.wav",
speaker_wav="path/to/english_reference.wav",
language="fr",
)Supported languages: English, French, German, Spanish, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Korean, Hungarian, and Hindi.
Step 4: Batch Generation
For generating many clips from the same voice:
texts = [
"First paragraph of content.",
"Second paragraph with different intonation.",
"Third paragraph to complete the sequence.",
]
for i, text in enumerate(texts):
tts.tts_to_file(
text=text,
file_path=f"output_{i}.wav",
speaker_wav="path/to/reference.wav",
language="en",
)
print(f"Generated clip {i+1}/{len(texts)}")Performance Notes
- Apple Silicon / NVIDIA GPU: Performance varies by model version, backend, memory, and thermal limits.
- CPU only: Usable for short experiments, but often slow for long-form generation.
- Memory: Leave headroom for model weights, reference audio, and the rest of the system.
Common XTTS-v2 Pitfalls
- Voice drift: On generations longer than ~30 seconds, the voice can lose consistency. Split long text into shorter segments and concatenate the WAV files.
- Reference too short: At exactly 3 seconds, XTTS-v2 captures the timbre but often produces a “flattened” version of the voice. Use 10-30 seconds for reliable results.
- Language mismatch: If the reference audio language does not match the
languageparameter, the model may produce accented or unstable output. Always setlanguageto the reference language if staying in that language.
Approach 2: ElevenLabs Voice Lab (Cloud, Easiest, Highest Quality)
ElevenLabs remains one of the best-known cloud voice cloning platforms. Its Voice Lab platform handles much of the setup behind the scenes.
This approach is best for non-developers, anyone who wants high quality with less setup, or teams that need cloud-hosted multilingual voice workflows.
What You Need
- An ElevenLabs account
- Reference audio file (MP3 or WAV, clean, 1+ minutes)
- A web browser or their API key for programmatic access
Step 1: Prepare Your Reference Audio
ElevenLabs recommends:
- Instant Voice Cloning: 1-5 minutes of clean, varied speech. The free tier allows one instant clone.
- Professional Voice Cloning: 30+ minutes of studio-quality audio. This costs extra but produces significantly better results, especially for long-form content.
Clean your audio before uploading — ElevenLabs preprocessing handles moderate noise but cannot fix heavy compression or background music.
Step 2: Create the Clone (Dashboard)
- Go to ElevenLabs Voice Lab
- Click “Add Voice” → “Instant Voice Cloning”
- Upload your reference file(s). You can upload multiple files for the same voice.
- Name your voice and confirm the upload
- Wait for processing
- Test the clone with a sample sentence
The clone is immediately available in the Voice Library for generation.
Step 3: Generate Speech (Dashboard)
- Go to the Speech Synthesis tab
- Select your cloned voice from the Voice dropdown
- Enter your text or paste a script
- Adjust stability and similarity sliders:
- Stability (0-100): Lower values produce more expressive, varied speech. Higher values are more consistent and robotic. Start at 50 and adjust based on your content.
- Similarity (0-100): How closely the output matches the reference voice. Higher values are more faithful but may introduce artifacts. Start at 80.
- Click Generate and preview
- Export as MP3 or WAV
Step 4: Generate Speech (API)
For programmatic access, use the ElevenLabs API:
import requests
API_KEY = "your_api_key"
VOICE_ID = "voice_id_from_dashboard"
url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}"
headers = {
"Accept": "audio/mpeg",
"Content-Type": "application/json",
"xi-api-key": API_KEY,
}
data = {
"text": "This is my cloned voice generated via the ElevenLabs API.",
"model_id": "eleven_multilingual_v2",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.8,
},
}
response = requests.post(url, json=data, headers=headers)
with open("output.mp3", "wb") as f:
f.write(response.content)Pricing
ElevenLabs pricing and included usage can change. Check the current pricing page before committing to a plan, especially if you need high-volume generation or professional voice cloning.
When ElevenLabs Makes Sense
- You need high quality with less technical setup
- You need cloud-hosted multilingual support
- Your generation volume fits the current plan limits and pricing
- You want a polished UI and do not need offline access
When ElevenLabs Does Not Make Sense
- You need privacy — every upload and generation goes through their servers
- You generate large volumes and need predictable costs
- You need offline or on-device processing
- You want to iterate quickly on small text changes (each revision costs API credits and requires network round-trips)
Approach 3: Qwen3-TTS via MLX (Local, 3-Second Cloning, Apple Silicon)
Qwen3-TTS is an open-weight TTS family worth evaluating for local and developer workflows. Exact cloning behavior, quality, licensing, and runtime support should be verified against the current release.
On Apple Silicon Macs, the MLX framework provides native Metal acceleration, making Qwen3-TTS run efficiently on M-series hardware.
This approach is best treated as a developer experiment unless you have verified the current model, runtime, and license for your use case.
What You Need
- Apple Silicon Mac (M1, M2, M3, or M4) with 8GB+ unified memory
- Python 3.11 or newer
- 5 GB free disk space for model weights
- Your reference audio file (WAV, 16 kHz preferred)
Step 1: Install MLX and Qwen3-TTS
pip install mlx mlx-lm soundfileClone the Qwen3-TTS repository:
git clone https://github.com/QwenLM/Qwen3-TTS
cd Qwen3-TTSStep 2: Voice Cloning with MLX
Qwen3-TTS supports voice cloning through a speaker prefix — you provide a short audio sample and the model generates speech matching that voice:
import mlx.core as mx
from qwen3_tts import Qwen3TTS
# Load the 1.7B model (requires ~4GB memory)
model = Qwen3TTS.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
dtype=mx.float16,
)
# Clone voice from reference and generate speech
audio = model.generate(
text="This voice was cloned from a three-second reference using Qwen3-TTS running locally on Mac.",
speaker_ref="path/to/reference.wav",
language="en",
)
# Save the output
import soundfile as sf
sf.write("qwen3_clone_output.wav", audio, 24000)That is the basic shape of the workflow. The exact API and architecture details may change by release, so follow the current project documentation.
Step 3: Streaming Generation
For real-time or low-latency applications, verify whether the current release supports streaming well enough for your latency target:
stream = model.generate_stream(
text="This is streaming voice cloning output. Each chunk arrives with minimal latency.",
speaker_ref="path/to/reference.wav",
language="en",
)
for i, audio_chunk in enumerate(stream):
sf.write(f"qwen3_stream_chunk_{i}.wav", audio_chunk, 24000)Step 4: Cross-Lingual Cloning
If the current model release supports cross-lingual cloning, test each target language directly:
audio = model.generate(
text="Cette voix a été clonée à partir d'un échantillon de trois secondes.",
speaker_ref="path/to/english_reference.wav",
language="fr",
)Supported languages and quality vary by release. Verify the current model card before building around a specific language list.
Performance Notes
- Apple Silicon: Performance depends on model size, memory, backend, and current MLX support
- Memory: Leave enough unified memory for model weights and audio buffers
- Streaming: Measure time-to-first-audio on your target Mac before promising realtime behavior
When Qwen3-TTS (via MLX) Makes Sense
- You have an Apple Silicon Mac and want local voice cloning
- You need fast iteration without API costs or data leaving your machine
- You want to test short-reference cloning claims yourself
- You need streaming or low-latency voice generation
- You have verified the current license terms for your commercial use case
Reference Audio Best Practices
These practices apply regardless of which tool you choose.
Recording Your Own Reference
If you are recording the speaker yourself:
- Use a decent microphone: A USB dynamic mic (Shure MV7, Rode PodMic, Audio-Technica ATR2100x) costs $100-150 and makes a massive difference over built-in laptop mics.
- Record in a treated space: No echo. A closet full of clothes is better than an empty room. Use a portable vocal booth or even a heavy blanket over your head and mic.
- Position the mic correctly: 4-6 inches from the mouth, slightly off-axis to avoid plosives. Use a pop filter.
- Record at 44.1 kHz / 24-bit: WAV format. This gives the model maximum data to work with.
- Record 5-10 minutes of natural speech: Have the speaker talk naturally about a topic they care about. Avoid reading — reading changes vocal delivery.
- Leave room tone: Record 10 seconds of silence in the room where you recorded. Some preprocessing tools use this for noise profiling.
Cleaning Existing Recordings
If you cannot record fresh audio and must use existing recordings:
- Noise reduction: Use Audacity (Effect → Noise Reduction). Sample a section of pure background noise, get the noise profile, then apply reduction at 12-18 dB.
- Remove silence: Trim leading/trailing silence. XTTS-v2 and Qwen3-TTS handle silence fine, but ElevenLabs preprocessing works better when the file starts with speech.
- Normalize volume: Loudness normalize to -16 LUFS (integrated). This ensures consistent volume across multiple reference files.
- Convert to WAV: Use FFmpeg:
ffmpeg -i input.mp3 -ar 44100 -ac 1 output.wav - Band-pass filter: Remove frequencies below 80 Hz and above 8 kHz if the recording is noisy. This cleans up rumble and hiss without affecting voice quality.
Using Multiple Reference Files
All three tools support multiple reference files for the same voice. This helps when:
- The speaker’s voice varies across recordings (different days, different energy levels)
- You want to capture a wider phonetic range
- You need to average out background noise that varies between files
With XTTS-v2, pass multiple files:
tts.tts_to_file(
text="Voice from multiple references.",
file_path="output.wav",
speaker_wav=["ref1.wav", "ref2.wav", "ref3.wav"],
language="en",
)With ElevenLabs, upload multiple files when creating the voice. The platform averages them internally.
With Qwen3-TTS, concatenate multiple references into a single WAV file before passing as speaker_ref.
Improving Clone Quality
Beyond reference audio quality, these techniques directly improve output quality:
Text Preprocessing
The model generates speech from text. If the text has issues, the speech will too:
- Expand abbreviations: “Dr. Smith went to St. Louis” → “Doctor Smith went to Saint Louis”
- Write numbers as words: “1234” → “one thousand two hundred thirty-four” (unless you want it read as a digit sequence)
- Add punctuation: Commas create brief pauses. Periods create full stops. Question marks raise terminal pitch. Without punctuation, speech runs together.
- Use phonetics for unusual words: Some tools accept SSML or phonetic notation for proper names or brand names that the model mispronounces.
Prompt Engineering for Expressiveness
You can influence delivery by writing the text with natural speech patterns:
- Add filler words (“um,” “well,” “you know”) for casual, conversational delivery
- Use contractions (“I’ll,” “don’t,” “we’re”) instead of full forms (“I will,” “do not,” “we are”)
- Vary sentence length: A series of short, punchy sentences followed by a long, flowing sentence creates natural rhythm
- Write like a person talks: “The data indicates a significant upward trend” → “The numbers are going up, and that is a big deal”
Post-Processing Generated Audio
Even the best clone benefits from light post-processing:
- Trim silence: Remove leading/trailing silence from each clip
- Crossfade segments: If you split long text into chunks, apply 10-20ms crossfades at the join points
- Match volume: Normalize all clips to the same loudness level (-16 LUFS for speech)
- EQ gently: A light high-pass filter at 60 Hz removes subsonic rumble. A tiny 2-3 dB presence boost around 3-4 kHz adds clarity
Cross-Lingual Voice Cloning
Cross-lingual cloning means taking a voice reference in language A and generating speech in language B while preserving the speaker’s identity.
How It Works
All three tools extract a speaker representation that is (mostly) language-independent. The representation captures timbre, pitch range, and vocal quality but not language-specific features like accent or phoneme inventory.
When generating in a new language, the model applies the speaker’s voice characteristics to the phonemes of the target language.
Quality Expectations
- XTTS-v2: The strongest cross-lingual performer among open-source options. A voice cloned from English audio speaking French retains the timbre well. Expect a slight accent.
- ElevenLabs: Excellent cross-lingual quality across all 29 languages. The accent bleed is minimal.
- Qwen3-TTS: Good quality within the 10 supported languages. Accent transfer is more noticeable than XTTS-v2 but overall quality is higher for the languages that are supported.
Tips for Better Cross-Lingual Results
- Use a reference in the target language if possible: The model has to guess less about how the speaker would sound in that language.
- Multi-language references: If the speaker speaks multiple languages, provide a reference in each language you plan to generate.
- Accept a slight accent: A voice that speaks all languages with a native accent would sound like different people. Some accent bleed makes the clone sound like the same person speaking a second language — which is often more realistic.
Ethical Checklist for Responsible Voice Cloning
Voice cloning technology can cause real harm when misused. Before you clone any voice, run through this checklist.
Consent
- Do you have explicit permission from the speaker? Not implied, not “they would probably be fine with it.” Written or recorded consent.
- Is the speaker aware of the specific use case? “Clone my voice for a podcast” is not the same as “clone my voice for automated customer service calls.”
- Can consent be revoked? Have a process for deleting the clone and any generated audio if the speaker changes their mind.
- Is the speaker compensated? For commercial use of a voice clone, compensation discussion should happen before cloning, not after.
Disclosure
- Label AI-generated audio: Every piece of generated content should be labeled as AI-generated, either in the content itself (audio watermark, intro disclosure) or in the metadata.
- Never impersonate without full disclosure: Using a cloned voice to pretend to be someone without the audience knowing is fraud, regardless of intent.
- Platform terms matter: Most platforms (YouTube, TikTok, podcast directories) now require disclosure of AI-generated voices. Check the terms before publishing.
Misuse Prevention
- Do not clone public figures without permission: Even parody protections are thin. Celebrity voice clones without consent have led to lawsuits and platform bans.
- Do not clone voices for authentication bypass: Using cloning to bypass banking, phone, or account authentication can be illegal and harmful.
- Do not create harmful content: Defamation, harassment, fraud, and misinformation generated with a cloned voice can create serious legal and safety risk.
- Use watermarks when available: If your tool provides provenance, watermarking, or disclosure features, enable them where appropriate.
Practical Steps to Stay Ethical
- Create a written agreement with the speaker covering use cases, duration, and revocation
- Store clone files and reference audio securely — encrypt them if they leave your machine
- Add disclosure to every generated file’s metadata (e.g., “AI-generated voice, used with permission”)
- Delete clones and reference audio when the project ends
- Never use voice cloning for real-time voice conversion in phone calls or live streams without explicit two-way consent
Comparison Table
| Feature | XTTS-v2 (Local) | ElevenLabs (Cloud) | Qwen3-TTS via MLX (Local) |
|---|---|---|---|
| Quality | Good | Excellent | Very Good |
| Reference time | Short samples can work; more may help | Depends on product mode | Verify current release |
| Languages | Multilingual | Multilingual | Verify current release |
| Cross-lingual | Supported in some workflows | Supported in some workflows | Verify current release |
| Cost | Free software plus hardware/time | Subscription or usage-based | Free software plus hardware/time |
| Privacy | Fully local | Cloud upload required | Fully local |
| Setup | Python, pip | Browser or API | Python, MLX |
| Hardware | GPU (4GB VRAM) | Any device with internet | Apple Silicon Mac |
| License | MPL 2.0 | Proprietary | Apache 2.0 |
| Streaming | Not the usual focus | Yes | Verify current release |
| Best for | Developers, offline, multi-language | Highest quality, non-technical users | Mac users, local privacy, speed |
Which Approach Should You Choose?
Choose XTTS-v2 if: You want fully offline voice cloning, you need 17 languages, you are comfortable with Python, and you are willing to spend time tuning the setup for quality.
Choose ElevenLabs if: Quality and ease are your top priorities, you do not mind cloud processing, and the current pricing fits your budget.
Choose Qwen3-TTS via MLX if: You have an Apple Silicon Mac, you are comfortable with developer tooling, and you have verified the current model behavior and license for your use case.
The Role of a Dedicated Tool: Spokio for Mac
Each approach above has a gap. Open-source tools can require Python expertise. Cloud services process uploads and generations on provider infrastructure. Developer-oriented local models can involve command-line setup.
Spokio fills that gap for Mac users who want local voice cloning without the friction.
Spokio brings local English voice cloning to Mac without requiring Python or a cloud TTS API. It is powered by Chatterbox Turbo, runs locally on Apple Silicon and Intel Macs, supports voice cloning from short samples, background processing, batch export, a queue manager with job history, and MP3/WAV/AIFF/M4A export. It does not upload text, audio, or voice samples to cloud services.
For creators who need private voiceover generation, client drafts, content production, or internal prototyping, Spokio provides a local Mac workflow for generating English speech from text.
Summary
Voice cloning in 2026 is practical, accessible, and increasingly high quality. The choice between tools comes down to your priorities:
- Privacy and control? Go local with XTTS-v2 or Qwen3-TTS via MLX
- Ease and quality? ElevenLabs remains unbeatable for a cloud workflow
- Mac local workflow? Spokio packages offline English TTS and local voice cloning into a native Mac app
Start with good reference audio — clean, varied, and representative of how the speaker actually talks. That single investment improves every tool’s output more than any configuration change.
And always clone responsibly. Permission, disclosure, and a clear use case are the baseline for using this technology ethically.
