How to Run Orpheus TTS Locally on Mac

Orpheus TTS is an open-source text-to-speech model from Canopy Labs. It generates speech tokens with a Llama-style language model and decodes them into audio with the SNAC neural audio codec.

Orpheus is larger than lightweight local models such as Kokoro. It is also more expressive: the finetuned model supports preset voices and inline tags such as <laugh>, <chuckle>, and <sigh>.

There are several ways to run Orpheus locally on a Mac. The best option depends on whether you want a short Python script, a graphical model manager, a local API server, or the official upstream implementation.

For architecture details, speech-token decoding, and training notes, read the Orpheus technical guide.

Ways to Run Orpheus TTS Locally on Mac

Method	Best for	Intel Mac	Apple Silicon Mac
orpheus-cpp	Recommended code-first setup with a small Python API	CPU only	Yes, with Metal
LM Studio with a GGUF model	Graphical model management and local server setup	Yes	Yes
llama.cpp server with GGUF	Lower-level local server control	Yes	Yes, with Metal
Community FastAPI and Web UI wrappers	Browser interfaces and OpenAI-compatible APIs	Check the wrapper	Check the wrapper
Official orpheus-speech package	Upstream streaming workflow on CUDA servers	Not recommended	Not recommended

If you have a MacBook Air, MacBook Pro, or desktop Mac with an M-series chip, start with orpheus-cpp. Use LM Studio if you prefer a graphical interface for downloading and serving a quantized GGUF model.

System Requirements

Orpheus can run locally on Apple Silicon Macs with M1, M2, M3, M4, or M5-series chips. That includes MacBook Air and MacBook Pro laptops, plus Mac mini, iMac, Mac Studio, and Mac Pro desktops with Apple Silicon.

Intel Macs can run GGUF models on the CPU, but generation will generally be slower. For a comfortable Apple Silicon setup, start with:

macOS
Python 3.10 or newer
16 GB of unified memory or more
Several GB of free disk space for dependencies, caches, and model files

Quantized models reduce memory use. A larger quantization can preserve more quality but may need more memory and take longer to generate speech.

Option 1: orpheus-cpp With Metal Acceleration

orpheus-cpp is a community package that runs Orpheus with llama-cpp-python. On Apple Silicon, install a Metal-enabled llama-cpp-python wheel so the language-model stage can use the Mac GPU.

Create a project and virtual environment:

mkdir orpheus-mac
cd orpheus-mac
python3 -m venv .venv
source .venv/bin/activate

Install the Metal wheel and orpheus-cpp:

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal

pip install orpheus-cpp

Create run_orpheus.py:

from orpheus_cpp import OrpheusCpp
from scipy.io.wavfile import write

orpheus = OrpheusCpp()

sample_rate, samples = orpheus.tts(
    "Hello from Orpheus running locally on this Mac.",
    options={"voice_id": "tara"},
)

write("orpheus.wav", sample_rate, samples.squeeze())

Run it:

python run_orpheus.py

The first run downloads the required model files. Later runs can reuse the cache. The package also exposes synchronous and streaming generation APIs for applications that need more control.

Use Emotion Tags

Orpheus supports expressive inline tags in the text:

from orpheus_cpp import OrpheusCpp
from scipy.io.wavfile import write

orpheus = OrpheusCpp()

sample_rate, samples = orpheus.tts(
    "<chuckle> I did not expect that. <sigh> Let us try again.",
    options={"voice_id": "tara"},
)

write("orpheus-expressive.wav", sample_rate, samples.squeeze())

Useful tags include <giggle>, <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, and <gasp>.

Start the orpheus-cpp Browser Demo

Install the optional FastRTC demo dependency:

pip install fastrtc

Start the local interface:

python -m orpheus_cpp

Open http://localhost:7860 in your browser.

Option 2: LM Studio With a GGUF Model

LM Studio is a graphical application for downloading and serving local language models. Because Orpheus uses a Llama-style backbone, community projects can run a quantized GGUF model through LM Studio and decode its speech tokens with SNAC.

One practical community client is orpheus-tts-local.

Install LM Studio.
Download an Orpheus GGUF model from the client README’s current recommendations.
Load the model in LM Studio.
Start LM Studio’s local server on http://127.0.0.1:1234.
Clone the community client and install its dependencies.
Generate a WAV file through the local server.

git clone https://github.com/isaiahbjork/orpheus-tts-local.git
cd orpheus-tts-local
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python gguf_orpheus.py \
  --text "Hello from Orpheus running locally on this Mac." \
  --voice tara \
  --output orpheus-lm-studio.wav

Follow the client’s current README for its interactive prompts and server settings. The client sends the text prompt to LM Studio, extracts the generated speech tokens, decodes them with SNAC, and writes the audio output locally.

The finetuned English model supports these preset voices:

tara
leah
jess
leo
dan
mia
zac
zoe

LM Studio is useful when you want a graphical model manager and a visible local-server status panel.

Option 3: llama.cpp Server With GGUF

Use llama.cpp when you want the same GGUF model workflow without LM Studio. This is a lower-level setup for developers who want direct control over the local server.

Install llama.cpp with Homebrew:

brew install llama.cpp

Download a compatible Orpheus GGUF model by following the current model link from orpheus-tts-local. Then start the server:

llama-server \
  --model ./orpheus-3b-0.1-ft-q4_k_m.gguf \
  --host 127.0.0.1 \
  --port 1234 \
  --ctx-size 8192 \
  --n-gpu-layers 99

On Apple Silicon, --n-gpu-layers 99 offloads model layers through Metal when the installed llama.cpp build supports it. Intel Mac users should omit that flag if GPU offload is unavailable.

The server produces model tokens rather than a WAV file by itself. Use a client such as orpheus-tts-local to send prompts and decode the returned SNAC tokens into audio.

Choose a GGUF Quantization

GGUF models are commonly distributed with quantization labels such as Q4, Q5, and Q8.

Quantization	Tradeoff
`Q4`	Lower memory use and a sensible starting point for many Macs
`Q5`	Middle ground between size and fidelity
`Q8`	Larger download and memory use with less quantization

Availability depends on the current model repository. Start with a Q4 conversion if you are unsure.

Option 4: Community FastAPI and Web UI Wrappers

For a browser UI and an OpenAI-compatible local endpoint, review Orpheus-FastAPI.

This project can connect to local backends such as LM Studio or llama.cpp. It is useful when another local app expects an HTTP speech API or when you want to test voices in a browser.

Orpheus-FastAPI is a community project rather than the official Canopy Labs runtime. Check its current README, supported backends, installation steps, and API schema before integrating it into an application.

Option 5: Official Python Package

Canopy Labs publishes the official orpheus-speech Python package:

pip install orpheus-speech

The package exposes a streaming API:

from orpheus_tts import OrpheusModel

model = OrpheusModel(
    model_name="canopylabs/orpheus-tts-0.1-finetune-prod",
)

syn_tokens = model.generate_speech(
    prompt="Hello from the official Orpheus package.",
    voice="tara",
)

for audio_chunk in syn_tokens:
    # Stream or save each decoded audio chunk.
    pass

The official workflow uses vLLM and is designed for supported CUDA GPU environments. It is the right reference implementation for GPU servers, but it is not the recommended way to run Orpheus on a Mac.

Why This Guide Does Not Recommend MLX

MLX is a strong Apple Silicon framework, but the current mlx-audio documentation does not list Orpheus as a supported TTS model. Use the Metal acceleration available through llama.cpp-based tools instead.

Verify current upstream support before adopting any new community MLX port.

Which Orpheus Setup Should You Choose?

Use orpheus-cpp if you have an Apple Silicon Mac and want the shortest code-first path.

Use LM Studio with orpheus-tts-local if you prefer a graphical model manager and local server controls.

Use llama.cpp with orpheus-tts-local if you want direct control over a GGUF server.

Use Orpheus-FastAPI if you want a community browser UI or an OpenAI-compatible local endpoint.

Use the official orpheus-speech package for supported CUDA GPU servers, not as the default Mac setup.

For a lighter model that can run on more Macs, read how to run Kokoro TTS locally on Mac.

Troubleshooting

Generation Is Slow

Orpheus is a 3B-class autoregressive model. Use a quantized GGUF model, enable Metal offloading on Apple Silicon, and keep prompts short while testing.

macOS Runs Out of Memory

Start with a Q4 GGUF model. Close memory-intensive applications and load only one model at a time. Intel Macs and 8 GB Macs may be a poor fit for Orpheus.

llama-cpp-python Uses Only the CPU

Reinstall the Metal wheel from the llama-cpp-python Metal index. Confirm that your Mac has an Apple Silicon chip and review the terminal output for Metal initialization messages.

The llama.cpp Server Does Not Produce WAV Files

This is expected. The language model generates speech tokens. A client such as orpheus-tts-local must pass them through the SNAC decoder to create audio.

Output Repeats or Becomes Unstable

Try a shorter prompt and review the client’s sampling settings. Orpheus is an autoregressive model, so temperature and repetition-penalty settings can affect stability.

Emotion Tags Sound Inconsistent

Use tags sparingly and test them with the target voice. The model learned these tags as prompt tokens, so their effect can vary with the surrounding text and sampling settings.

Run Orpheus Privately on Your Mac

Orpheus offers expressive local speech generation, but it needs more compute than compact TTS models. On an Apple Silicon Mac, start with orpheus-cpp or a quantized GGUF model. Move to LM Studio, llama.cpp, or a local API wrapper when your workflow needs more control.

If you want a native Mac TTS workflow without maintaining model servers, try Spokio.