cosyvoice 3macapple siliconmlxmetalvoice cloninglocal tts

How to Run CosyVoice 3 Locally on Mac

Learn how to run CosyVoice 3 locally on a Mac with speech-swift, MLX, mlx-audio-plus, Rust Candle bindings, and the official FunAudioLLM Python repository.

Updated on Jun 01, 202614 min read

CosyVoice 3 is an open-source text-to-speech model from the FunAudioLLM team. It supports multilingual speech generation, zero-shot voice cloning, cross-lingual synthesis, instruction control, and streaming-oriented workflows.

There are several ways to run CosyVoice 3 locally on a Mac. On an Apple Silicon Mac, start with a native MLX runtime. Use the official Python repository when you need the complete upstream implementation and can accept a more involved setup.

For architecture details, speech tokenization, flow matching, and training notes, read the CosyVoice 3 technical guide.

Ways to Run CosyVoice 3 Locally on Mac

Method Best for Intel Mac Apple Silicon Mac
speech-swift CLI Recommended Mac setup, voice cloning, emotion tags, and multilingual generation No Yes
mlx-audio-plus Python-friendly MLX command-line inference No Yes
cosyvoice3.rs Python bindings with Rust, Candle, and Metal acceleration CPU only Yes, with Metal
Official FunAudioLLM repository Complete upstream Python implementation, Web UI, API server, and optional vLLM Possible but difficult Possible but difficult

If you have a MacBook Air, MacBook Pro, or desktop Mac with an M-series chip, start with speech-swift. Choose mlx-audio-plus if you prefer a Python-based MLX workflow.

System Requirements

The recommended paths require an Apple Silicon Mac with an M1, M2, M3, M4, or M5-series chip. That includes MacBook Air and MacBook Pro laptops, plus Mac mini, iMac, Mac Studio, and Mac Pro desktops with Apple Silicon.

For the native Swift path, install:

  • macOS 15 or newer
  • Xcode 16 or newer
  • Swift 6 or newer
  • At least 16 GB of unified memory recommended
  • Several GB of free disk space for model files and caches

Intel Macs cannot use MLX. Community CPU runtimes may work, but generation is slower. If you have an Intel Mac or an 8 GB Mac, consider a smaller model such as Kokoro TTS.

Option 1: speech-swift CLI With MLX

MLX is Apple’s machine learning framework for Apple Silicon. The speech-swift toolkit runs CosyVoice 3 locally with MLX GPU acceleration. It also uses a Core ML speaker encoder when you provide a reference audio file for voice cloning.

Clone and build the package:

git clone https://github.com/soniqo/speech-swift.git
cd speech-swift
make build

Generate an English WAV file:

.build/release/speech speak "Hello from CosyVoice 3 running locally on this Mac." \
  --engine cosyvoice \
  -o output.wav

The first run downloads the model weights and caches them locally. Later runs can reuse the cached files.

Select a Language

CosyVoice 3 supports nine languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian.

.build/release/speech speak "Guten Tag. CosyVoice läuft lokal auf diesem Mac." \
  --engine cosyvoice \
  --language german \
  -o german.wav

Clone a Voice

Pass a clean reference recording with --voice-sample:

.build/release/speech speak "This line uses a locally cloned voice." \
  --engine cosyvoice \
  --voice-sample reference.wav \
  -o cloned.wav

The runtime uses a CAM++ speaker encoder through Core ML to extract a speaker embedding from the reference audio. Only clone a voice when you have permission to use it.

Clone a Voice Across Languages

Use the reference voice for another supported language:

.build/release/speech speak "Bonjour. Cette voix fonctionne aussi en français." \
  --engine cosyvoice \
  --voice-sample reference.wav \
  --language french \
  -o cloned-french.wav

Use Emotion Tags and Instructions

CosyVoice 3 supports inline style tags:

.build/release/speech speak "(excited) This is excellent news. (calm) Let us review the details." \
  --engine cosyvoice \
  -o expressive.wav

Use a freeform instruction for the whole utterance:

.build/release/speech speak "The presentation begins in five minutes." \
  --engine cosyvoice \
  --cosy-instruct "Speak clearly with a calm, professional tone." \
  -o instructed.wav

Option 2: mlx-audio-plus

mlx-audio-plus is a community MLX package with a documented CosyVoice 3 conversion. It is separate from the main mlx-audio package, so install the package name shown below.

Create a Python environment:

mkdir cosyvoice3-mlx
cd cosyvoice3-mlx
python3 -m venv .venv
source .venv/bin/activate
pip install mlx-audio-plus

Generate speech:

mlx_audio.tts.generate \
  --model mlx-community/Fun-CosyVoice3-0.5B-2512-4bit \
  --text "Hello from the CosyVoice 3 MLX conversion."

The 4-bit model reduces memory pressure on Apple Silicon. Check the current mlx-audio-plus package documentation for additional options and supported CosyVoice 3 workflows before integrating it into an application.

Option 3: cosyvoice3.rs With Metal

cosyvoice3.rs is a community implementation using Candle, Rust, and Python bindings. It supports CPU, CUDA, and Metal backends without requiring PyTorch.

For an Apple Silicon Mac, download the current Metal wheel from the cosyvoice3.rs releases page and install it:

pip install cosyvoice3-<version>+metal-cp310-abi3-macosx_11_0_arm64.whl

Download the converted weights:

pip install huggingface_hub
hf download spensercai/CosyVoice3-0.5B-Candle \
  --local-dir ./CosyVoice3-0.5B-Candle

Create run_cosyvoice3.py:

import struct
import wave

from cosyvoice3 import CosyVoice3, PyDevice

model = CosyVoice3(
    "./CosyVoice3-0.5B-Candle",
    device=PyDevice("metal"),
    use_f16=True,
)

audio = model.inference_zero_shot(
    text="Hello from CosyVoice 3 running locally on this Mac.",
    prompt_text=(
        "You are a helpful assistant.<|endofprompt|>"
        "This is the exact transcript of the reference audio."
    ),
    prompt_wav="reference.wav",
)

audio_int16 = [
    int(max(-32768, min(32767, sample * 32767)))
    for sample in audio
]

with wave.open("cosyvoice3-metal.wav", "w") as wav:
    wav.setnchannels(1)
    wav.setsampwidth(2)
    wav.setframerate(model.sample_rate)
    wav.writeframes(struct.pack(f"{len(audio_int16)}h", *audio_int16))

Run it:

python run_cosyvoice3.py

The transcript after <|endofprompt|> must accurately match the reference recording. This matters for voice-cloning quality.

Cross-Lingual Synthesis

The Candle wrapper also provides a cross-lingual API:

audio = model.inference_cross_lingual(
    text="<|en|>This is cross-lingual speech synthesis.",
    prompt_wav="reference.wav",
)

Instruction-Guided Synthesis

Use the instruction API when you want style or dialect control:

audio = model.inference_instruct(
    text="你好世界",
    instruct_text="You are a helpful assistant. 请用广东话表达。<|endofprompt|>",
    prompt_wav="reference.wav",
)

Option 4: Official FunAudioLLM Python Repository

The official repository provides the complete CosyVoice implementation, including model downloads, example.py, a local Web UI, API servers, and optional vLLM integration.

This route is useful when you need to evaluate upstream behavior directly. It is more involved on macOS than the Apple Silicon-focused community runtimes.

Clone the repository:

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice

Create the documented Python 3.10 environment with Conda:

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txt

Download the CosyVoice 3 model from Hugging Face:

from huggingface_hub import snapshot_download

snapshot_download(
    "FunAudioLLM/Fun-CosyVoice3-0.5B-2512",
    local_dir="pretrained_models/Fun-CosyVoice3-0.5B",
)

Run the upstream examples:

python example.py

Start the Official Web UI

The repository includes webui.py. Run it with the CosyVoice 3 model folder:

python webui.py \
  --port 50000 \
  --model_dir pretrained_models/Fun-CosyVoice3-0.5B

Open http://127.0.0.1:50000 in your browser.

API and vLLM Deployment

The official repository also includes FastAPI and gRPC server code. Its deployment examples use Docker and NVIDIA GPU settings. Optional vLLM integration has specific supported-version requirements, so follow the current upstream README before using those paths.

These deployment instructions are designed primarily for GPU server environments rather than a typical MacBook setup.

Which CosyVoice 3 Setup Should You Choose?

Use speech-swift if you have an Apple Silicon Mac and want the most practical local setup with voice cloning, language selection, and emotion tags.

Use mlx-audio-plus if you want a Python-friendly MLX command-line workflow.

Use cosyvoice3.rs if you want Python bindings with a Rust/Candle Metal backend and direct access to zero-shot, cross-lingual, and instructed synthesis APIs.

Use the official FunAudioLLM repository when you need the complete upstream implementation, Web UI, or server deployment examples.

For a smaller model that is easier to run on Intel Macs, read how to run Kokoro TTS locally on Mac.

Troubleshooting

The First Run Takes a Long Time

The first run downloads model files and caches them locally. Allow several GB of free disk space and wait for the initial download to complete.

macOS Runs Out of Memory

Use a 4-bit MLX model, close memory-intensive applications, and test shorter inputs first. A Mac with 16 GB of unified memory or more is a sensible starting point.

MLX Does Not Work on an Intel Mac

MLX requires Apple Silicon. Intel Macs can experiment with CPU-oriented community runtimes or the official repository, but generation will generally be slower.

Voice Cloning Sounds Wrong

Use a clean reference recording without music, overlapping speakers, or heavy echo. For APIs that require a reference transcript, make sure it matches the recording exactly. Only clone a voice when you have permission.

The Official Python Install Fails on macOS

The upstream dependency stack is designed primarily around Linux and GPU server environments. Use speech-swift, mlx-audio-plus, or the Metal wheel from cosyvoice3.rs for a more Mac-focused route.

Run CosyVoice 3 Privately on Your Mac

CosyVoice 3 combines multilingual speech generation, voice cloning, and style control in a model that can run locally on modern Apple Silicon Macs. Start with speech-swift, then move to a Python MLX or Candle wrapper when your workflow needs a different integration layer.

If you want a native Mac TTS workflow without maintaining model environments, try Spokio.

More from the blog