CosyVoice 3 is an open-source text-to-speech model from the FunAudioLLM team. It supports multilingual speech generation, zero-shot voice cloning, cross-lingual synthesis, instruction control, and streaming-oriented workflows.
There are several ways to run CosyVoice 3 locally on a Mac. On an Apple Silicon Mac, start with a native MLX runtime. Use the official Python repository when you need the complete upstream implementation and can accept a more involved setup.
For architecture details, speech tokenization, flow matching, and training notes, read the CosyVoice 3 technical guide.
Ways to Run CosyVoice 3 Locally on Mac
| Method | Best for | Intel Mac | Apple Silicon Mac |
|---|---|---|---|
| speech-swift CLI | Recommended Mac setup, voice cloning, emotion tags, and multilingual generation | No | Yes |
| mlx-audio-plus | Python-friendly MLX command-line inference | No | Yes |
| cosyvoice3.rs | Python bindings with Rust, Candle, and Metal acceleration | CPU only | Yes, with Metal |
| Official FunAudioLLM repository | Complete upstream Python implementation, Web UI, API server, and optional vLLM | Possible but difficult | Possible but difficult |
If you have a MacBook Air, MacBook Pro, or desktop Mac with an M-series chip,
start with speech-swift. Choose mlx-audio-plus if you prefer a Python-based
MLX workflow.
System Requirements
The recommended paths require an Apple Silicon Mac with an M1, M2, M3, M4, or M5-series chip. That includes MacBook Air and MacBook Pro laptops, plus Mac mini, iMac, Mac Studio, and Mac Pro desktops with Apple Silicon.
For the native Swift path, install:
- macOS 15 or newer
- Xcode 16 or newer
- Swift 6 or newer
- At least 16 GB of unified memory recommended
- Several GB of free disk space for model files and caches
Intel Macs cannot use MLX. Community CPU runtimes may work, but generation is slower. If you have an Intel Mac or an 8 GB Mac, consider a smaller model such as Kokoro TTS.
Option 1: speech-swift CLI With MLX
MLX is Apple’s machine learning framework
for Apple Silicon. The speech-swift toolkit runs CosyVoice 3 locally with MLX
GPU acceleration. It also uses a
Core ML speaker encoder
when you provide a reference audio file for voice cloning.
Clone and build the package:
git clone https://github.com/soniqo/speech-swift.git
cd speech-swift
make buildGenerate an English WAV file:
.build/release/speech speak "Hello from CosyVoice 3 running locally on this Mac." \
--engine cosyvoice \
-o output.wavThe first run downloads the model weights and caches them locally. Later runs can reuse the cached files.
Select a Language
CosyVoice 3 supports nine languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian.
.build/release/speech speak "Guten Tag. CosyVoice läuft lokal auf diesem Mac." \
--engine cosyvoice \
--language german \
-o german.wavClone a Voice
Pass a clean reference recording with --voice-sample:
.build/release/speech speak "This line uses a locally cloned voice." \
--engine cosyvoice \
--voice-sample reference.wav \
-o cloned.wavThe runtime uses a CAM++ speaker encoder through Core ML to extract a speaker embedding from the reference audio. Only clone a voice when you have permission to use it.
Clone a Voice Across Languages
Use the reference voice for another supported language:
.build/release/speech speak "Bonjour. Cette voix fonctionne aussi en français." \
--engine cosyvoice \
--voice-sample reference.wav \
--language french \
-o cloned-french.wavUse Emotion Tags and Instructions
CosyVoice 3 supports inline style tags:
.build/release/speech speak "(excited) This is excellent news. (calm) Let us review the details." \
--engine cosyvoice \
-o expressive.wavUse a freeform instruction for the whole utterance:
.build/release/speech speak "The presentation begins in five minutes." \
--engine cosyvoice \
--cosy-instruct "Speak clearly with a calm, professional tone." \
-o instructed.wavOption 2: mlx-audio-plus
mlx-audio-plus is a community MLX package with a documented CosyVoice 3
conversion. It is separate from the main mlx-audio package, so install the
package name shown below.
Create a Python environment:
mkdir cosyvoice3-mlx
cd cosyvoice3-mlx
python3 -m venv .venv
source .venv/bin/activate
pip install mlx-audio-plusGenerate speech:
mlx_audio.tts.generate \
--model mlx-community/Fun-CosyVoice3-0.5B-2512-4bit \
--text "Hello from the CosyVoice 3 MLX conversion."The 4-bit model reduces memory pressure on Apple Silicon. Check the current
mlx-audio-plus package documentation for additional options and supported
CosyVoice 3 workflows before integrating it into an application.
Option 3: cosyvoice3.rs With Metal
cosyvoice3.rs is a community implementation using
Candle, Rust, and Python bindings. It
supports CPU, CUDA, and Metal backends without requiring PyTorch.
For an Apple Silicon Mac, download the current Metal wheel from the
cosyvoice3.rs releases
page and install it:
pip install cosyvoice3-<version>+metal-cp310-abi3-macosx_11_0_arm64.whlDownload the converted weights:
pip install huggingface_hub
hf download spensercai/CosyVoice3-0.5B-Candle \
--local-dir ./CosyVoice3-0.5B-CandleCreate run_cosyvoice3.py:
import struct
import wave
from cosyvoice3 import CosyVoice3, PyDevice
model = CosyVoice3(
"./CosyVoice3-0.5B-Candle",
device=PyDevice("metal"),
use_f16=True,
)
audio = model.inference_zero_shot(
text="Hello from CosyVoice 3 running locally on this Mac.",
prompt_text=(
"You are a helpful assistant.<|endofprompt|>"
"This is the exact transcript of the reference audio."
),
prompt_wav="reference.wav",
)
audio_int16 = [
int(max(-32768, min(32767, sample * 32767)))
for sample in audio
]
with wave.open("cosyvoice3-metal.wav", "w") as wav:
wav.setnchannels(1)
wav.setsampwidth(2)
wav.setframerate(model.sample_rate)
wav.writeframes(struct.pack(f"{len(audio_int16)}h", *audio_int16))Run it:
python run_cosyvoice3.pyThe transcript after <|endofprompt|> must accurately match the reference
recording. This matters for voice-cloning quality.
Cross-Lingual Synthesis
The Candle wrapper also provides a cross-lingual API:
audio = model.inference_cross_lingual(
text="<|en|>This is cross-lingual speech synthesis.",
prompt_wav="reference.wav",
)Instruction-Guided Synthesis
Use the instruction API when you want style or dialect control:
audio = model.inference_instruct(
text="你好世界",
instruct_text="You are a helpful assistant. 请用广东话表达。<|endofprompt|>",
prompt_wav="reference.wav",
)Option 4: Official FunAudioLLM Python Repository
The official repository provides the complete CosyVoice implementation,
including model downloads, example.py, a local Web UI, API servers, and
optional vLLM integration.
This route is useful when you need to evaluate upstream behavior directly. It is more involved on macOS than the Apple Silicon-focused community runtimes.
Clone the repository:
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoiceCreate the documented Python 3.10 environment with Conda:
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txtDownload the CosyVoice 3 model from Hugging Face:
from huggingface_hub import snapshot_download
snapshot_download(
"FunAudioLLM/Fun-CosyVoice3-0.5B-2512",
local_dir="pretrained_models/Fun-CosyVoice3-0.5B",
)Run the upstream examples:
python example.pyStart the Official Web UI
The repository includes webui.py. Run it with the CosyVoice 3 model folder:
python webui.py \
--port 50000 \
--model_dir pretrained_models/Fun-CosyVoice3-0.5BOpen http://127.0.0.1:50000 in your browser.
API and vLLM Deployment
The official repository also includes FastAPI and gRPC server code. Its
deployment examples use Docker and NVIDIA GPU settings. Optional vLLM
integration has specific supported-version requirements, so follow the current
upstream README before using those paths.
These deployment instructions are designed primarily for GPU server environments rather than a typical MacBook setup.
Which CosyVoice 3 Setup Should You Choose?
Use speech-swift if you have an Apple Silicon Mac and want the most practical
local setup with voice cloning, language selection, and emotion tags.
Use mlx-audio-plus if you want a Python-friendly MLX command-line workflow.
Use cosyvoice3.rs if you want Python bindings with a Rust/Candle Metal backend
and direct access to zero-shot, cross-lingual, and instructed synthesis APIs.
Use the official FunAudioLLM repository when you need the complete upstream implementation, Web UI, or server deployment examples.
For a smaller model that is easier to run on Intel Macs, read how to run Kokoro TTS locally on Mac.
Troubleshooting
The First Run Takes a Long Time
The first run downloads model files and caches them locally. Allow several GB of free disk space and wait for the initial download to complete.
macOS Runs Out of Memory
Use a 4-bit MLX model, close memory-intensive applications, and test shorter inputs first. A Mac with 16 GB of unified memory or more is a sensible starting point.
MLX Does Not Work on an Intel Mac
MLX requires Apple Silicon. Intel Macs can experiment with CPU-oriented community runtimes or the official repository, but generation will generally be slower.
Voice Cloning Sounds Wrong
Use a clean reference recording without music, overlapping speakers, or heavy echo. For APIs that require a reference transcript, make sure it matches the recording exactly. Only clone a voice when you have permission.
The Official Python Install Fails on macOS
The upstream dependency stack is designed primarily around Linux and GPU
server environments. Use speech-swift, mlx-audio-plus, or the Metal wheel
from cosyvoice3.rs for a more Mac-focused route.
Run CosyVoice 3 Privately on Your Mac
CosyVoice 3 combines multilingual speech generation, voice cloning, and style
control in a model that can run locally on modern Apple Silicon Macs. Start
with speech-swift, then move to a Python MLX or Candle wrapper when your
workflow needs a different integration layer.
If you want a native Mac TTS workflow without maintaining model environments, try Spokio.
