Voice cloning is the process of generating speech that sounds like a specific person using a short audio sample of their voice. The sample is converted into a speaker embedding — a numerical representation that captures the unique characteristics of that voice.
A speaker embedding is extracted from the reference audio by a speaker encoder network. This embedding is then conditioned into the TTS model alongside the text, so the generated speech matches the reference voice’s timbre, pitch range, and speaking style.
Few-shot cloning uses 3-10 seconds of audio. Captures the general voice character but may miss finer details.** Multi-sample cloning uses 30-60 seconds of varied audio. Produces a more accurate and stable clone.** Fine-tuned cloning trains or adapts the model on a larger dataset of the target voice. Highest quality but requires more compute and data.
Cloud cloning services upload the voice sample to external servers for processing. The sample, any text sent for generation, and the generated audio exist on infrastructure outside the user’s control.
Local cloning processes everything on the same machine. The sample never leaves the device, and generation runs entirely offline. This matters when the voice sample is sensitive (client work, private recordings) or when the cloned voice is used for confidential content.
A clone is as good as the sample. Noisy recordings, limited vocal range, or emotional monotony in the source produce corresponding limitations in the clone. Cross-lingual cloning — using a voice from language A to speak language B — is harder than same-language cloning and often produces accented output.