swiftmlxlocal ttsdeveloperson-device aiapple silicontext-to-speechkokoroqwen3speech-swift

Local TTS on Apple Silicon: A Swift Developer's Guide to MLX Speech Models

A practical guide to running local TTS models on Apple Silicon Macs using Swift and MLX — kokoro-ios, speech-swift, mlx-audio, with code examples and deployment tradeoffs.

Updated on May 22, 202612 min read

If you are an Apple developer and you have been on X recently, you have seen the clips: someone generates speech from a local model on their MacBook, no cloud API key, no Python environment. The audio sounds good. The code looks clean. And it is running on Apple Silicon using Swift.

The ecosystem has matured fast. One year ago, local TTS in Swift often meant either wrapping a Python subprocess or using Apple’s built-in AVSpeechSynthesizer. Today, there are several options: kokoro-ios for a simple single-model SPM package, speech-swift for a Swift-native speech toolkit, and mlx-audio for broader model experimentation.

This guide shows you how each one works, with real code and honest tradeoffs.

Why Swift and MLX for TTS?

Apple Silicon’s unified memory architecture gives MLX-based TTS a practical advantage. CPU and GPU share unified memory, which can simplify local model deployment compared with discrete GPU systems.

MLX also targets Apple Silicon directly. You call the model from Swift, the runtime handles local acceleration paths, and the app receives audio data you can play with AVAudioPlayer.

Option 1: kokoro-ios — Focused Swift Integration

kokoro-ios is a single-purpose Swift Package Manager package for Kokoro-82M. It is useful when you want a focused Kokoro integration without managing a larger speech toolkit.

Adding to Your Project

In Xcode: File → Add Package Dependencies → enter https://github.com/mlalma/kokoro-ios

Or add to your Package.swift:

dependencies: [
    .package(url: "https://github.com/mlalma/kokoro-ios", from: "1.0.0")
]

Generating Speech

import AVFoundation
import Kokoro

let kokoro = Kokoro()

// Load model (downloads on first run, cached afterward)
try await kokoro.loadModel()

// Generate audio
let audioData = try await kokoro.generate(
    "Hello from Swift. This is Kokoro running locally on my Mac.",
    voice: "af_heart"
)

// Play or save
let audioPlayer = try AVAudioPlayer(data: audioData)
audioPlayer.play()

Voice Selection

let voices = ["af_heart", "af_bella", "af_nicole", "am_adam", "am_michael"]

for voice in voices {
    let audio = try await kokoro.generate(
        "Testing voice \(voice).",
        voice: voice
    )
    // Save each voice sample
    try audio.write(to: URL(filePath: "\(voice).wav"))
}

Per-Token Timestamps (v1.0.8+)

let result = try await kokoro.generateWithTimestamps(
    "This sentence has word-level timing.",
    voice: "af_heart"
)

for segment in result.segments {
    print("\(segment.startTime): \(segment.text)")
}

Performance

Performance depends on device, model version, quantization, and whether the model is already cached. Measure first-load time, warm generation speed, memory use, and audio quality on your target devices.

Model weights may download on first launch and cache to the app’s support directory.

When to Use kokoro-ios

Use this when you want a simple TTS integration in a Swift app. One dependency, one model, one focused API. The tradeoff: common Kokoro workflows use preset voices rather than voice cloning, and you are limited to the voices the model ships with.

Option 2: speech-swift — Swift-Native Speech Toolkit

speech-swift by Soniqo is a Swift speech library. It supports multiple TTS models and related speech tasks such as ASR, voice activity detection, speaker diarization, forced alignment, and speech-to-speech workflows.

It is Swift/MLX-based. Models can download from Hugging Face on first use and cache locally.

Adding speech-swift

dependencies: [
    .package(url: "https://github.com/sonniqo/speech-swift", from: "0.1.0")
]

TTS with Qwen3-TTS

One TTS model used in speech-swift examples is Qwen3-TTS. Check the current documentation for available model variants, voice-cloning behavior, and language support.

import AVFoundation
import SpeechSwift

let tts = SpeechSwift.TTS()

// Load Qwen3-TTS 0.6B (4-bit quantized)
try await tts.loadModel(.qwen3TTS_0_6B)

// Generate speech
let audio = try await tts.generate(
    "Qwen3-TTS running locally via speech-swift.",
    voice: .default
)

// Play
let player = try AVAudioPlayer(data: audio)
player.play()

Voice Cloning

// Clone voice from a reference audio file
let referenceAudio = try Data(contentsOf: URL(filePath: "reference.wav"))

let clonedSpeech = try await tts.generate(
    "This voice was cloned from a three-second reference.",
    voice: .cloned(from: referenceAudio)
)

Multilingual TTS

let languages: [Locale] = [.en, .ja, .fr, .de, .zh]

for lang in languages {
    let audio = try await tts.generate(
        "Hello in \(lang.identifier).",
        voice: .default,
        language: lang
    )
    // Save language sample
}

Streaming Audio

speech-swift supports streaming generation for real-time applications:

let stream = tts.generateStream(
    "Long text that should start playing as soon as the first chunk is ready."
)

for try await chunk in stream {
    // Play chunks as they arrive
    playAudioChunk(chunk)
}

Full-Duplex Speech-to-Speech

speech-swift also includes PersonaPlex 7B, a full-duplex model that can listen and speak simultaneously — useful for voice assistants and conversational AI:

let conversation = SpeechSwift.Conversation(model: .personaPlex7B)

// The model handles both ASR and TTS in a single pipeline
let response = try await conversation.process(audio: userAudio)
// response contains generated speech audio

When to Use speech-swift

speech-swift is worth evaluating when you need more than TTS — voice cloning, multilingual support, streaming, or speech-to-speech. The tradeoff is complexity: more models, more configuration, and larger initial downloads depending on the model.

Option 3: mlx-audio — Broader Model Support

mlx-audio by Blaizzy began as a Python library and now has Swift package work around the same ecosystem. It is useful for experimenting with multiple TTS model architectures such as Kokoro, Qwen3-TTS, CosyVoice, CSM-1B, Fish Speech, Orpheus, and MOSS TTS. Review current documentation for supported models and quantization levels.

Adding mlx-audio Swift

dependencies: [
    .package(url: "https://github.com/Blaizzy/mlx-audio-swift", from: "0.1.0")
]

Generating Speech with Kokoro

import MLXAudio

let tts = try await MLXAudio.TTS(model: .kokoro82M)

let audio = try await tts.generate(
    "MLX Audio running Kokoro on Apple Silicon.",
    voice: "af_heart"
)

Using Multiple Models

// Switch between models at runtime
let models: [MLXAudio.ModelType] = [
    .kokoro82M,
    .qwen3TTS_0_6B,
    .cosyVoice,
    .csm1B,
]

for modelType in models {
    let tts = try await MLXAudio.TTS(model: modelType)
    let audio = try await tts.generate(
        "Testing \(modelType) on Apple Silicon.",
        voice: .default
    )
    try audio.write(to: URL(filePath: "\(modelType).wav"))
}

REST API Mode

mlx-audio also ships an OpenAI-compatible server. This is useful if you want to use the models from a non-Swift app while keeping inference local:

mlx_audio.tts.serve --model mlx-community/Kokoro-82M-bf16

Then from any HTTP client:

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "Hello from the MLX Audio server.",
    "voice": "af_heart"
  }' \
  --output speech.wav

When to Use mlx-audio

Use mlx-audio when you want to experiment with multiple TTS models, or when you need OpenAI-compatible API compatibility for local development. The tradeoff: Swift support and documentation can lag behind Python examples, and model switching requires downloading weights for each architecture.

Performance Considerations on Apple Silicon

Performance depends on device, memory pressure, model size, quantization, cache state, and runtime version. Use this as a measurement checklist rather than a fixed benchmark:

Library Model What to Measure
kokoro-ios Kokoro-82M First load, warm generation, memory, voice quality
speech-swift Qwen3-TTS variants Reference audio behavior, language support, memory
mlx-audio Kokoro-82M Runtime setup, warm generation, package stability
mlx-audio Qwen3-TTS variants Model download size, quantization behavior, memory
mlx-audio CSM / other models Compatibility, speed, and voice quality

RTF = Real-Time Factor (audio duration / wall time). Above 1.0 means faster than real-time.

First-load measurements should separate model download time from model initialization time.

Building a SwiftUI TTS App

Here is a minimal SwiftUI view that ties everything together:

import SwiftUI
import AVFoundation
import Kokoro

struct TTSView: View {
    @State private var text = ""
    @State private var isGenerating = false
    @State private var kokoro: Kokoro?
    @State private var audioPlayer: AVAudioPlayer?

    var body: some View {
        VStack(spacing: 16) {
            TextEditor(text: $text)
                .frame(height: 200)
                .border(.gray.opacity(0.2))

            Button(isGenerating ? "Generating..." : "Generate Speech") {
                generate()
            }
            .disabled(text.isEmpty || isGenerating)
        }
        .padding()
        .task {
            kokoro = Kokoro()
            try? await kokoro?.loadModel()
        }
    }

    func generate() {
        guard let kokoro else { return }
        isGenerating = true

        Task {
            do {
                let audio = try await kokoro.generate(text, voice: "af_heart")
                audioPlayer = try AVAudioPlayer(data: audio)
                audioPlayer?.play()
            } catch {
                print("TTS failed: \(error)")
            }
            isGenerating = false
        }
    }
}

With speech-swift, the same pattern can be adapted for voice cloning and multilingual workflows depending on the model you choose.

Choosing Your Approach

Library Good Fit Models Complexity
kokoro-ios Simple TTS in any Swift app Kokoro-82M Minimal
speech-swift Voice cloning, multilingual, streaming, ASR Qwen3-TTS, CosyVoice, PersonaPlex 7B Moderate
mlx-audio Experimenting with many models, API server Kokoro, Qwen3, CosyVoice, CSM, Orpheus, Fish, MOSS Moderate

Going Further

If you want to understand the lower-level details, read the speech-swift documentation for architecture notes and latency guidance. The MLX Swift repository has examples for running custom models if you want to go beyond the packaged libraries.

For Mac users who need local TTS without managing Swift packages, models, or audio pipelines, Spokio provides a native app powered by Chatterbox Turbo. It runs on Apple Silicon and Intel Macs, supports local voice cloning and batch export, exports MP3, WAV, AIFF, and M4A, and does not upload text, audio, or voice samples to cloud services.

References

More from the blog