Orpheus TTS is an open-source text-to-speech model from Canopy Labs. It generates speech tokens with a Llama-style language model and decodes them into audio with the SNAC neural audio codec.
Orpheus is larger than lightweight local models such as
Kokoro.
It is also more expressive: the finetuned model supports preset voices and
inline tags such as <laugh>, <chuckle>, and <sigh>.
There are several ways to run Orpheus locally on a Mac. The best option depends on whether you want a short Python script, a graphical model manager, a local API server, or the official upstream implementation.
For architecture details, speech-token decoding, and training notes, read the Orpheus technical guide.
Ways to Run Orpheus TTS Locally on Mac
| Method | Best for | Intel Mac | Apple Silicon Mac |
|---|---|---|---|
| orpheus-cpp | Recommended code-first setup with a small Python API | CPU only | Yes, with Metal |
| LM Studio with a GGUF model | Graphical model management and local server setup | Yes | Yes |
| llama.cpp server with GGUF | Lower-level local server control | Yes | Yes, with Metal |
| Community FastAPI and Web UI wrappers | Browser interfaces and OpenAI-compatible APIs | Check the wrapper | Check the wrapper |
| Official orpheus-speech package | Upstream streaming workflow on CUDA servers | Not recommended | Not recommended |
If you have a MacBook Air, MacBook Pro, or desktop Mac with an M-series chip,
start with orpheus-cpp. Use LM Studio if you prefer a graphical interface for
downloading and serving a quantized GGUF model.
System Requirements
Orpheus can run locally on Apple Silicon Macs with M1, M2, M3, M4, or M5-series chips. That includes MacBook Air and MacBook Pro laptops, plus Mac mini, iMac, Mac Studio, and Mac Pro desktops with Apple Silicon.
Intel Macs can run GGUF models on the CPU, but generation will generally be slower. For a comfortable Apple Silicon setup, start with:
- macOS
- Python 3.10 or newer
- 16 GB of unified memory or more
- Several GB of free disk space for dependencies, caches, and model files
Quantized models reduce memory use. A larger quantization can preserve more quality but may need more memory and take longer to generate speech.
Option 1: orpheus-cpp With Metal Acceleration
orpheus-cpp is a community package that runs Orpheus with
llama-cpp-python.
On Apple Silicon, install a Metal-enabled llama-cpp-python wheel so the
language-model stage can use the Mac GPU.
Create a project and virtual environment:
mkdir orpheus-mac
cd orpheus-mac
python3 -m venv .venv
source .venv/bin/activateInstall the Metal wheel and orpheus-cpp:
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
pip install orpheus-cppCreate run_orpheus.py:
from orpheus_cpp import OrpheusCpp
from scipy.io.wavfile import write
orpheus = OrpheusCpp()
sample_rate, samples = orpheus.tts(
"Hello from Orpheus running locally on this Mac.",
options={"voice_id": "tara"},
)
write("orpheus.wav", sample_rate, samples.squeeze())Run it:
python run_orpheus.pyThe first run downloads the required model files. Later runs can reuse the cache. The package also exposes synchronous and streaming generation APIs for applications that need more control.
Use Emotion Tags
Orpheus supports expressive inline tags in the text:
from orpheus_cpp import OrpheusCpp
from scipy.io.wavfile import write
orpheus = OrpheusCpp()
sample_rate, samples = orpheus.tts(
"<chuckle> I did not expect that. <sigh> Let us try again.",
options={"voice_id": "tara"},
)
write("orpheus-expressive.wav", sample_rate, samples.squeeze())Useful tags include <giggle>, <laugh>, <chuckle>, <sigh>, <cough>,
<sniffle>, <groan>, <yawn>, and <gasp>.
Start the orpheus-cpp Browser Demo
Install the optional FastRTC demo dependency:
pip install fastrtcStart the local interface:
python -m orpheus_cppOpen http://localhost:7860 in your browser.
Option 2: LM Studio With a GGUF Model
LM Studio is a graphical application for downloading and serving local language models. Because Orpheus uses a Llama-style backbone, community projects can run a quantized GGUF model through LM Studio and decode its speech tokens with SNAC.
One practical community client is orpheus-tts-local.
- Install LM Studio.
- Download an Orpheus GGUF model from the client README’s current recommendations.
- Load the model in LM Studio.
- Start LM Studio’s local server on
http://127.0.0.1:1234. - Clone the community client and install its dependencies.
- Generate a WAV file through the local server.
git clone https://github.com/isaiahbjork/orpheus-tts-local.git
cd orpheus-tts-local
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python gguf_orpheus.py \
--text "Hello from Orpheus running locally on this Mac." \
--voice tara \
--output orpheus-lm-studio.wavFollow the client’s current README for its interactive prompts and server settings. The client sends the text prompt to LM Studio, extracts the generated speech tokens, decodes them with SNAC, and writes the audio output locally.
The finetuned English model supports these preset voices:
taraleahjessleodanmiazaczoe
LM Studio is useful when you want a graphical model manager and a visible local-server status panel.
Option 3: llama.cpp Server With GGUF
Use llama.cpp when you want the same GGUF model workflow without LM Studio.
This is a lower-level setup for developers who want direct control over the
local server.
Install llama.cpp with
Homebrew:
brew install llama.cppDownload a compatible Orpheus GGUF model by following the current model link
from orpheus-tts-local. Then start the server:
llama-server \
--model ./orpheus-3b-0.1-ft-q4_k_m.gguf \
--host 127.0.0.1 \
--port 1234 \
--ctx-size 8192 \
--n-gpu-layers 99On Apple Silicon, --n-gpu-layers 99 offloads model layers through Metal when
the installed llama.cpp build supports it. Intel Mac users should omit that
flag if GPU offload is unavailable.
The server produces model tokens rather than a WAV file by itself. Use a
client such as orpheus-tts-local to send prompts and decode the returned SNAC
tokens into audio.
Choose a GGUF Quantization
GGUF models are commonly distributed with quantization labels such as Q4,
Q5, and Q8.
| Quantization | Tradeoff |
|---|---|
Q4 |
Lower memory use and a sensible starting point for many Macs |
Q5 |
Middle ground between size and fidelity |
Q8 |
Larger download and memory use with less quantization |
Availability depends on the current model repository. Start with a Q4
conversion if you are unsure.
Option 4: Community FastAPI and Web UI Wrappers
For a browser UI and an OpenAI-compatible local endpoint, review Orpheus-FastAPI.
This project can connect to local backends such as LM Studio or llama.cpp.
It is useful when another local app expects an HTTP speech API or when you want
to test voices in a browser.
Orpheus-FastAPI is a community project rather than the official Canopy Labs runtime. Check its current README, supported backends, installation steps, and API schema before integrating it into an application.
Option 5: Official Python Package
Canopy Labs publishes the official orpheus-speech Python package:
pip install orpheus-speechThe package exposes a streaming API:
from orpheus_tts import OrpheusModel
model = OrpheusModel(
model_name="canopylabs/orpheus-tts-0.1-finetune-prod",
)
syn_tokens = model.generate_speech(
prompt="Hello from the official Orpheus package.",
voice="tara",
)
for audio_chunk in syn_tokens:
# Stream or save each decoded audio chunk.
passThe official workflow uses vLLM and is designed for supported CUDA GPU environments. It is the right reference implementation for GPU servers, but it is not the recommended way to run Orpheus on a Mac.
Why This Guide Does Not Recommend MLX
MLX
is a strong Apple Silicon framework, but the current mlx-audio
documentation does not list Orpheus as a supported TTS model. Use the Metal
acceleration available through llama.cpp-based tools instead.
Verify current upstream support before adopting any new community MLX port.
Which Orpheus Setup Should You Choose?
Use orpheus-cpp if you have an Apple Silicon Mac and want the shortest
code-first path.
Use LM Studio with orpheus-tts-local if you prefer a graphical model manager
and local server controls.
Use llama.cpp with orpheus-tts-local if you want direct control over a GGUF
server.
Use Orpheus-FastAPI if you want a community browser UI or an OpenAI-compatible local endpoint.
Use the official orpheus-speech package for supported CUDA GPU servers, not
as the default Mac setup.
For a lighter model that can run on more Macs, read how to run Kokoro TTS locally on Mac.
Troubleshooting
Generation Is Slow
Orpheus is a 3B-class autoregressive model. Use a quantized GGUF model, enable Metal offloading on Apple Silicon, and keep prompts short while testing.
macOS Runs Out of Memory
Start with a Q4 GGUF model. Close memory-intensive applications and load only
one model at a time. Intel Macs and 8 GB Macs may be a poor fit for Orpheus.
llama-cpp-python Uses Only the CPU
Reinstall the Metal wheel from the llama-cpp-python Metal index. Confirm that
your Mac has an Apple Silicon chip and review the terminal output for Metal
initialization messages.
The llama.cpp Server Does Not Produce WAV Files
This is expected. The language model generates speech tokens. A client such as
orpheus-tts-local must pass them through the SNAC decoder to create audio.
Output Repeats or Becomes Unstable
Try a shorter prompt and review the client’s sampling settings. Orpheus is an autoregressive model, so temperature and repetition-penalty settings can affect stability.
Emotion Tags Sound Inconsistent
Use tags sparingly and test them with the target voice. The model learned these tags as prompt tokens, so their effect can vary with the surrounding text and sampling settings.
Run Orpheus Privately on Your Mac
Orpheus offers expressive local speech generation, but it needs more compute
than compact TTS models. On an Apple Silicon Mac, start with orpheus-cpp or a
quantized GGUF model. Move to LM Studio, llama.cpp, or a local API wrapper
when your workflow needs more control.
If you want a native Mac TTS workflow without maintaining model servers, try Spokio.
