appleavspeechsynthesizerpersonal voiceapple neural enginemlxmaclocal ttsdevelopers

Apple's Voice & TTS Stack: A Technical Introduction

A technical introduction to Apple's voice and text-to-speech stack on Mac: AVSpeechSynthesizer API, Personal Voice accessibility speech, Apple Neural Engine architecture, and the MLX framework for running AI models on Apple Silicon.

Updated on May 21, 202610 min read

Apple’s voice stack spans four distinct layers, each with different capabilities, performance characteristics, and API restrictions. For developers building local TTS apps on Mac — or evaluating which parts of the platform to use — understanding these layers explains what is possible, what is fast, and what Apple does not expose.


The Four Layers

┌─────────────────────────────────────┐
│  AVSpeechSynthesizer               │  System TTS API (highest level)
│  (AppKit/AVFoundation)             │
├─────────────────────────────────────┤
│  Personal Voice                     │  On-device accessibility voice
│  (macOS Sonoma+)                   │
├─────────────────────────────────────┤
│  Apple Neural Engine (ANE)         │  Hardware inference engine (NPU)
│  (M1–M4+)                          │
├─────────────────────────────────────┤
│  MLX Framework                      │  ML framework for Apple Silicon
│  (Apple open source)               │  (GPU/CPU, most flexible)
└─────────────────────────────────────┘

Each layer is built on top of or alongside the others. AVSpeechSynthesizer uses system voices through Apple’s opaque system stack, which may use local hardware acceleration internally. MLX targets Apple Silicon GPU and CPU for custom model work. Personal Voice is an on-device accessibility feature managed by the system.


Layer 1: AVSpeechSynthesizer

Apple’s native TTS API, available on macOS, iOS, iPadOS, watchOS, and tvOS. It is the simplest way to produce speech from text — a few lines of code, system voices, no model files to manage.

Basic usage

import AVFoundation

let synthesizer = AVSpeechSynthesizer()
let utterance = AVSpeechUtterance(string: "Hello, this is a system voice.")
utterance.voice = AVSpeechSynthesisVoice(language: "en-US")
utterance.rate = 0.5
utterance.pitchMultiplier = 1.0
utterance.volume = 1.0

synthesizer.speak(utterance)

Voice selection

Available voice identifiers cover over 60 languages and regional variants:

// List all voices
let voices = AVSpeechSynthesisVoice.speechVoices()

// Specific voices
let voice = AVSpeechSynthesisVoice(identifier: "com.apple.voice.compact.en-US.Samantha")

Voice categories:

  • Compact voices — Smaller footprint, lower quality, shipped with the OS
  • Enhanced voices — Higher quality, downloaded on demand (~200-500 MB each)
  • Personal Voice — System-managed accessibility voice (see Layer 2)

Enhanced voices use higher-quality system speech models. Apple does not expose enough detail to know exactly which hardware path each voice uses during inference.

Utterance control

AVSpeechUtterance provides basic prosody parameters:

Parameter Range Effect
rate 0.0–1.0 Speaking speed (iOS defaults vary by language)
pitchMultiplier 0.5–2.0 Relative pitch shift
volume 0.0–1.0 Output volume
preUtteranceDelay seconds Pause before utterance
postUtteranceDelay seconds Pause after utterance

Callback delegate

AVSpeechSynthesizerDelegate provides lifecycle callbacks:

class ViewController: AVSpeechSynthesizerDelegate {
    func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer,
                          didStart utterance: AVSpeechUtterance) { }
    func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer,
                          didFinish utterance: AVSpeechUtterance) { }
    func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer,
                          willSpeakRangeOfSpeechString characterRange: NSRange,
                          utterance: AVSpeechUtterance) {
        // Called for each word as it is spoken
    }
    func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer,
                          didCancel utterance: AVSpeechUtterance) { }
}

SSML support

AVSpeechSynthesizer includes an SSML initializer through AVSpeechUtterance(ssmlRepresentation:), but supported tags and behavior are limited and platform-dependent. Treat SSML as a convenience for simple markup, not a full cloud-style speech control layer.

let ssml = """
<speak>
  Welcome to <say-as interpret-as="spell-out">TTS</say-as>.
  <break time="500ms"/>
  <prosody rate="slow">This is emphasized speech.</prosody>
</speak>
"""
// Use with AVSpeechUtterance(ssmlRepresentation: ...)

Limitations

  • No custom model loading — you cannot plug your own TTS model into AVSpeechSynthesizer
  • No voice cloning API — Personal Voice is a separate system-managed accessibility feature
  • No raw audio access — you get system audio output, not PCM data
  • No streaming control — you cannot control chunk boundaries or get per-frame audio
  • No offline guarantee — enhanced voices require download; compact voices are lower quality

For apps that need custom voices, offline operation, or raw audio output, AVSpeechSynthesizer is a dead end. It is the right choice when a system voice is sufficient and minimal code is the priority.


Layer 2: Personal Voice

Introduced in macOS Sonoma (2023) and iOS 17, Personal Voice allows users to create a synthetic voice that sounds like themselves for supported accessibility workflows, running entirely on device.

How it works

Enrollment:

  1. The user records Apple-provided phrases in the system Personal Voice flow
  2. Audio is processed on-device
  3. The resulting voice is stored locally and managed by the operating system

Inference: Personal Voice is exposed through supported system accessibility features such as Live Speech. It is not a general-purpose third-party voice cloning API that developers can freely enroll, export, or control.

Technical architecture

Apple does not publish the full Personal Voice training pipeline. At a high level, the system appears to combine enrollment audio, speaker identity modeling, and an on-device speech synthesis pipeline, but developers should treat the implementation as opaque.

The important practical point is not the internal architecture. It is that Apple manages the model, enrollment, storage, and access boundaries.

API restrictions

  • No programmatic enrollment — the enrollment UI is system-driven; apps cannot trigger or automate it
  • No model export — the resulting CoreML model is not accessible to third-party apps
  • System-managed voices — availability and limits are controlled by Apple and user settings
  • No general API for re-synthesis — you cannot send arbitrary audio to improve the voice
  • Accessibility-first — designed for users at risk of speech loss (ALS, etc.)

For TTS app developers

Personal Voice is not a platform you can build on. It is a user-facing accessibility feature with tightly restricted APIs. You cannot:

  • Create a Personal Voice programmatically
  • Access the underlying CoreML model
  • Use a Personal Voice with a custom TTS engine
  • Enroll voices from a recording library

If your app needs voice cloning as a feature, you must implement it yourself using your own model stack — Personal Voice is not extensible.


Layer 3: Apple Neural Engine (ANE)

The ANE is Apple Silicon’s dedicated neural processing unit. Understanding its architecture is essential for performance optimization when running TTS models locally.

Architecture

Chip ANE Cores TOPS (INT8) Memory Bandwidth
M1 16 11 TOPS 68 GB/s
M2 16 15.8 TOPS 100 GB/s
M3 16 18 TOPS 150 GB/s
M4 16 38 TOPS 120 GB/s
A17 Pro 16 35 TOPS Shared

Each ANE core contains:

  • Neural processing units — Matrix multiply-accumulate arrays optimized for INT8 and FP16
  • Local memory — Per-core SRAM for activations and weights
  • DMA engines — Direct memory access to system RAM via Unified Memory

How the ANE differs from GPU

Aspect ANE GPU
Precision INT8, FP16 (limited FP32) FP16, FP32, FP64
Programming model Fixed-function graph General compute (Metal)
Latency per op Higher Lower
Throughput per watt Higher Lower
Best for Sustained inference Training + variable-shape
Supported ops Convolution, matmul, norm, activation Any GPU compute

Running TTS models with Apple hardware acceleration

Not all TTS model architectures map cleanly to Apple’s acceleration stack. Core ML may place supported operations on ANE, GPU, or CPU depending on the model and system heuristics. In general, Apple’s acceleration stack is strongest for:

  • Convolutional layers — commonly efficient on fixed-function accelerators
  • Matrix multiplies — Efficient for linear layers (transformer projections)
  • Layer normalization — commonly supported in optimized inference runtimes
  • ReLU, GELU, SiLU activations — Hardware accelerated

Common challenges:

  • Variable-length sequences — fixed-shape graphs are easier to optimize than dynamic text lengths
  • Attention mechanisms — transformer attention often fits GPU execution better than fixed-function accelerators
  • Iterative sampling — flow matching, diffusion, and autoregressive loops require repeated model invocations
  • Non-standard operations — unsupported ops fall back to GPU or CPU depending on the runtime

Practical implications for TTS

Vocoder acceleration:

  • Convolution-heavy vocoders can benefit from Core ML hardware acceleration
  • Fixed-shape inference is easier to optimize than variable-length generation
  • Actual placement and performance depend on model conversion and runtime decisions

Transformer backbone:

  • Transformer-heavy models often run most practically on GPU
  • Attention implementations, token length, and cache layout matter more than headline TOPS
  • Dynamic text lengths can add overhead in runtimes that prefer fixed shapes

For TTS apps: Core ML can be valuable when a model converts cleanly to fixed or predictable shapes. For variable-length, variable-shape generation, the GPU is often more practical despite higher power consumption.


Layer 4: MLX Framework

MLX is Apple’s open-source machine learning framework for Apple Silicon, released in December 2023 and actively developed. It is the lowest-level and most flexible layer in Apple’s voice stack.

What makes MLX different

Unified memory model: Unlike PyTorch or TensorFlow, which often require explicit movement between CPU and accelerator memory, MLX is designed around Apple’s Unified Memory. Arrays live in shared memory accessible by CPU and GPU without the same copy-heavy workflow.

# PyTorch: explicit device management
tensor = tensor.to("mps")  # copies to GPU
tensor = tensor.to("cpu")  # copies back

# MLX: no device management
import mlx.core as mx
array = mx.array([1, 2, 3])  # accessible everywhere

Lazy evaluation: Operations build a computation graph that is evaluated on demand, enabling the framework to optimize execution across CPU and GPU without user intervention.

NumPy-compatible API: MLX’s array API mirrors NumPy, reducing the learning curve for Python developers.

import mlx.core as mx

# Create and operate on arrays
a = mx.ones((4, 256))
b = mx.random.normal((256, 128))
c = a @ b  # matrix multiply, executed on optimal device

Running TTS models with MLX

MLX supports loading and running many TTS models directly:

import mlx.core as mx
import mlx.nn as nn

# Load a model (conceptual)
model = load_tts_model("kokoro-82m-mlx")
text_tokens = mx.array(tokenize("Hello world"))

# Generate audio features
with mx.stream():
    mel = model.generate(text_tokens)
    
# Mel → waveform via MLX vocoder
waveform = vocoder.decode(mel)

Model conversion:

  • Hugging Face models can be converted to MLX using mlx_lm.convert
  • ONNX models can be imported
  • CoreML models cannot be imported (Apple restriction)

MLX vs CoreML

Aspect MLX CoreML
Open source Yes (MIT) No
Apple Silicon Native (GPU + CPU) Native (ANE + GPU + CPU via Core ML runtime)
Model import PyTorch, ONNX, Safetensors PyTorch (via coremltools)
Dynamic shapes Supported (streaming-friendly) Limited (prefer fixed shapes)
Training Yes Limited (on-device fine-tuning)
Ecosystem Growing community models Apple’s model zoo
ANE access Not a public MLX execution target Indirect through Core ML runtime

Why MLX matters for local TTS

  1. No memory copies — TTS pipelines that move data between stages (encoder → decoder → vocoder) avoid the CPU↔GPU transfer overhead that PyTorch MPS incurs
  2. Streaming-friendly — MLX’s lazy evaluation and dynamic shape support make it practical for chunk-by-chunk TTS generation
  3. Apple Silicon optimization — MLX gives developers direct, practical access to GPU acceleration with a Python-first workflow
  4. Growing model library — Community ports of speech and audio models exist in MLX format
# MLX GPU execution (conceptual)
# Transformer-heavy stages run on Apple Silicon GPU
with mx.metal.stream() as gpu_stream:
    attention_output = attention_layer(hidden_states)

vocoder_output = conv_decoder(attention_output)

How the Layers Relate

Application (your TTS app)
        │
        ├── AVSpeechSynthesizer ─── System voices (compact/enhanced/Personal)
        │         │
        │         └── System inference (ANE/GPU, opaque to developer)
        │
        ├── CoreML / Personal Voice ─── Accessibility speech
        │         │
        │         └── On-device training + inference (Apple managed)
        │
        └── MLX ─── Custom models and model ports
                  │
                  ├── GPU ─── Dynamic shapes, attention, variable length
                  └── CPU ─── Fallback, control flow, text processing

Practical Implications for Mac TTS Apps

When to use AVSpeechSynthesizer

  • You need a quick voice preview with minimal code
  • System voices are good enough for your use case
  • You do not need custom models, voice cloning, or raw audio output
  • You are building a proof-of-concept or accessibility tool

When to use MLX + custom models

  • You need to run specific open TTS models or custom model ports
  • You need offline inference without downloading enhanced voices
  • You need raw audio output for further processing
  • You want to control chunking, streaming, and batch generation
  • You want voice cloning beyond Personal Voice

Hardware optimization strategy

  • Core ML-compatible fixed-shape models: Consider Core ML when conversion is clean
  • Transformer-heavy models: Target GPU for lowest latency
  • Hybrid pipelines: Use the runtime that best matches each stage rather than assuming one accelerator fits all

What Apple does not provide

  • No public API for TTS model inference at the Metal/CoreML level
  • No access to Personal Voice model data
  • No audio streaming API from AVSpeechSynthesizer
  • No built-in support for third-party TTS model formats
  • No direct public ANE programming model for custom TTS engines — you get what Core ML can schedule

Summary

Layer Access Level Best For Limitations
AVSpeechSynthesizer Public API System voice playback No custom models, no raw audio
Personal Voice System UI only Accessibility speech No general cloning API, Apple-managed
ANE Hardware (indirect) Efficient conv inference Fixed-shape, no streaming-friendly programming model
MLX Open source framework Custom TTS models, control Requires model conversion, smaller ecosystem than PyTorch

For a Mac TTS app like Spokio, the practical lesson is that Apple’s built-in speech APIs are useful but limited, while custom local TTS requires a separate model stack. Spokio uses Chatterbox Turbo for offline voice generation rather than relying on AVSpeechSynthesizer or Personal Voice. Personal Voice remains a user-facing accessibility feature that coexists with, but does not extend, the TTS API surface Apple exposes.

More from the blog