Speech to Text on Mac: Local ASR with Whisper, MLX, and Open-Source Models

If you have used Otter.ai, rev.com, or Google’s speech APIs, you have experienced cloud-based automatic speech recognition (ASR). Your audio leaves your machine, gets processed on remote servers, and text comes back.

There is another path. You can run ASR locally on your Mac — no upload to an ASR API, no API key, no per-minute billing. OpenAI’s Whisper made this practical in late 2022, and the ecosystem has matured rapidly since. MLX Whisper can run much faster than the original PyTorch implementation on Apple Silicon. whisper.cpp runs across a wide range of hardware. CoreML-optimized models can support near-real-time workflows on modern Macs.

This guide covers everything you need to build a local transcription pipeline on a Mac: model options, code, performance numbers, and tradeoffs.

What ASR Is — and How It Differs from TTS

Automatic speech recognition (ASR) converts audio into text. It is the inverse of text-to-speech (TTS), which converts text into audio.

	ASR	TTS
Input	Audio (spoken language)	Text
Output	Text (transcribed words)	Audio (synthetic speech)
Core challenge	Handling noise, accents, overlapping speech	Handling prosody, emotion, naturalness
Latency tolerance	Near-real-time preferred for live captioning	Often batch-processed
Typical architecture	Encoder-decoder with audio embeddings	Transformer or diffusion-based vocoder

The practical difference matters when you are building Mac workflows. TTS is compute-heavy on the decoding side (generating waveforms). ASR is compute-heavy on the encoding side (parsing spectrograms into tokens). On Apple Silicon, Metal, CoreML, MLX, and ANE-friendly operators can make certain ASR models fast — especially when the model and runtime are optimized for the hardware.

OpenAI Whisper: The Model That Changed Everything

Before Whisper, open-source ASR was fragmented. There was DeepSpeech (Mozilla, now archived), Kaldi (powerful but brutally complex), and various academic models that required dataset-specific training. None of them generalized well across domains.

Whisper changed this. Released in September 2022 under an MIT license, Whisper is a Transformer-based encoder-decoder model trained on 680,000 hours of multilingual audio. The training data covers 100+ languages, so Whisper handles code-switching and non-English transcription without separate language models.

Architecture Overview

Whisper’s architecture is simpler than it looks:

Input: 80-channel log-Mel spectrogram computed from 30-second audio chunks.
Encoder: A stack of Transformer encoder blocks (varies by model size) that processes the spectrogram into latent representations.
Decoder: A Transformer decoder that autoregressively predicts text tokens, conditioned on the encoder output.
Output: Token IDs that map to text via a multilingual tokenizer — or special tokens like <|startoftranscript|>, language tags, and timestamp markers.

The model predicts everything in one pass: the spoken language, the transcription, timestamps for each word (in the large models), and even silence detection. There is no separate language model or lexicon — Whisper is end-to-end.

Model Sizes and Tradeoffs

Whisper comes in five sizes. Larger models often outperform smaller ones, but the exact tradeoff depends on audio quality, language, hardware, and runtime.

Model	Parameters	VRAM (fp16)	Relative Speed (GPU)	Word Error Rate (LibriSpeech)
tiny	39M	~1 GB	32x	7.7%
base	74M	~1 GB	16x	5.8%
small	244M	~1.5 GB	6x	4.8%
medium	769M	~3 GB	2x	4.0%
large-v3	1.55B	~5 GB	1x	3.5%

The relative speed column is approximate and depends on hardware. On Apple Silicon with MLX, the gap narrows significantly — more on that below.

The important takeaway: for clean English audio in a quiet room, small or base is often good enough. For accented speech, noisy environments, or non-English languages, you want medium or large-v3.

large-v3 vs large-v2

Whisper large-v3 (released November 2023) improved on v2 primarily through better training data curation and slightly different training hyperparameters. The WER improvement is modest across many benchmarks. The model architecture is the same — the improvements come from data quality, not architectural changes.

For most practical purposes, v2 and v3 are interchangeable in code. Both use the same Hugging Face pipeline and the same MLX loading code. If you have limited disk space, v2 at 2.9GB is fine. If you want every fraction of a percent of accuracy, v3 at 3.1GB is worth the extra 200MB.

Running Whisper Locally: Four Approaches

There are four serious ways to run Whisper on a Mac, each with different tradeoffs.

1. Official OpenAI Whisper (PyTorch)

The original implementation. Works on any platform with PyTorch, but is the slowest option on Apple Silicon because PyTorch’s MPS backend (Metal Performance Shaders) does not accelerate every operator Whisper needs.

pip install openai-whisper torch torchaudio

import whisper

model = whisper.load_model("small")

result = model.transcribe("recording.mp3")
print(result["text"])

That is all the code you need. whisper.load_model downloads the model weights on first use, caches them, and returns a ready-to-use pipeline.

The transcribe method accepts file paths and returns a dictionary with text, segments, and language. Segments include per-word timestamps for large models.

result = model.transcribe("meeting.wav", language="en", verbose=True)
for segment in result["segments"]:
    print(f"[{segment['start']:.2f} → {segment['end']:.2f}] {segment['text']}")

What to watch for: The first run loads the model and may take 10–30 seconds. Subsequent transcriptions are instant weight reuse. PyTorch Whisper uses fp32 by default, which is wasteful on Apple Silicon — cast to fp16 for a 2x speedup:

model = whisper.load_model("small").half()

Even with .half(), PyTorch-based Whisper leaves performance on the table. The MPS graph has gaps for certain ops, causing CPU fallbacks that kill throughput.

Verdict: Works, easy to install, but the slowest option. Good for prototyping. Bad for batch processing.

2. faster-whisper (CTranslate2)

faster-whisper reimplements Whisper’s inference on top of CTranslate2, a C++ inference engine optimized for Transformer models. The same model weights, different runtime — 4x faster than PyTorch on CPU and 2–3x faster on GPU.

pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("small", device="cpu", compute_type="int8")

segments, info = model.transcribe("recording.mp3", beam_size=5)

print(f"Detected language: {info.language} (p={info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f} → {segment.end:.2f}] {segment.text}")

Key parameters:

device: "cpu" or "cuda". On Mac, "cpu" with int8 is the sweet spot unless you compile from source with CUDA support (not possible on Mac).
compute_type: "int8_float16", "int8", "float16", or "float32". Integer quantization gives 2–3x speedup over fp32 with negligible accuracy loss.
beam_size: Number of beams for beam search. 5 is a good default. Higher beams improve accuracy slightly but cost linearly more compute.

faster-whisper also supports Word-Level timestamps (WLT) via the word_timestamps=True flag:

segments, _ = model.transcribe("recording.mp3", word_timestamps=True)
for segment in segments:
    for word in segment.words:
        print(f"  {word.word} [{word.start:.2f} → {word.end:.2f}]")

Verdict: The best option if you need CPU inference or cannot use MLX. CTranslate2 is mature, well-tested, and handles quantization gracefully.

3. MLX Whisper — Optimized for Apple Silicon

MLX Whisper is Apple’s MLX framework applied to Whisper inference. MLX is a NumPy-compatible array framework designed specifically for Apple Silicon’s unified memory architecture. It uses the Metal GPU directly, bypassing the overhead of PyTorch’s MPS bridge.

The result: MLX Whisper can run substantially faster than PyTorch Whisper on the same Apple Silicon hardware, while using less memory.

pip install mlx

Actually, MLX is pip-installable, and the examples repo provides the inference script:

pip install mlx
git clone https://github.com/ml-explore/mlx-examples.git
cd mlx-examples/whisper

import mlx_whisper

result = mlx_whisper.transcribe("recording.mp3", path="large-v3")
print(result["text"])

The API mirrors OpenAI’s. The key difference is path instead of a model object — MLX loads weights lazily via a path string.

Why it is faster: Apple Silicon’s unified memory means the CPU and GPU share the same memory pool. MLX exploits this by keeping the model weights in shared memory and dispatching compute to whichever unit is fastest for each operation. PyTorch with MPS must copy data between CPU and GPU memory, which creates a bottleneck especially for the attention layers in Whisper’s decoder.

MLX also supports fp16 and quantized inference:

# fp16 (default, good balance)
result = mlx_whisper.transcribe("recording.mp3", path="large-v3")

# 4-bit quantization — smallest footprint
result = mlx_whisper.transcribe("recording.mp3", path="large-v3", quantized=True)

Quantization can reduce memory and improve throughput, with an accuracy tradeoff that depends on audio quality and model size.

Real-world numbers (transcribing a 60-minute podcast on M2 Max, large-v3):

Implementation	Wall Time	Peak RAM	Speed Factor
PyTorch Whisper (fp32)	~22 min	12 GB	1x
PyTorch Whisper (fp16)	~14 min	7 GB	1.6x
faster-whisper (int8, CPU)	~8 min	5 GB	2.75x
MLX Whisper (fp16)	~5 min	6 GB	4.4x
MLX Whisper (4-bit)	~4 min	2.5 GB	5.5x

These are approximate and vary by Mac model, thermal conditions, and audio complexity. The pattern is consistent: MLX is often one of the fastest options on Apple Silicon.

Verdict: A strong option for Apple Silicon Macs, especially for local batch transcription.

4. whisper.cpp (C++ / CoreML)

whisper.cpp is a pure C++ implementation by Georgi Gerganov (the same author as llama.cpp). It runs on anything — Mac, Linux, Windows, iOS, Android, even a Raspberry Pi. On Mac, it can use CoreML for GPU acceleration.

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
make
# Download model weights
bash models/download-ggml-model.sh small.en
# Transcribe
./main -m models/ggml-small.en.bin -f recording.wav

For CoreML acceleration, you need the Metal build:

WHISPER_COREML=1 make clean make -j

The CoreML path runs the encoder on the GPU via CoreML and the decoder on the CPU. In practice, this gives roughly 2x speedup over the vanilla GGML path on Mac, bringing it close to MLX in throughput.

whisper.cpp also supports streaming:

./stream -m models/ggml-small.en.bin -t 8 --step 500 --length 5000

This captures microphone input in 500ms chunks with 5-second windows, printing live transcriptions.

The tradeoff against MLX: whisper.cpp’s GGML format uses its own quantized weight format (q4_0, q5_1, q8_0). The quantization is good, but the GPU utilization on Mac is less efficient than MLX because whisper.cpp primarily runs the decoder on CPU and only offloads encoder attention layers to CoreML. MLX keeps everything on the GPU.

Verdict: Strong for cross-platform deployments and embedded devices. On Mac specifically, MLX Whisper may be faster and simpler for many users.

CoreML Optimizations for Real-Time Transcription

Beyond whisper.cpp, you can export any Whisper variant to CoreML directly using Apple’s coremltools. This converts the PyTorch model into a .mlpackage that runs on the ANE (Apple Neural Engine) and GPU.

The workflow:

pip install coremltools torch openai-whisper

import whisper
import coremltools as ct

# Load the PyTorch model
model = whisper.load_model("small")

# Trace the encoder
audio_features = model.encoder

# Trace and convert to CoreML
import torch

# Create a dummy input
dummy_input = torch.randn(1, 80, 3000)

traced_encoder = torch.jit.trace(audio_features, dummy_input)

# Convert to CoreML
mlmodel = ct.convert(
    traced_encoder,
    inputs=[ct.TensorType(shape=dummy_input.shape, name="audio")],
    outputs=[ct.TensorType(name="features")],
    minimum_deployment_target=ct.target.macOS13,
)

mlmodel.save("WhisperEncoder.mlpackage")

The decoder is harder to convert because it is autoregressive — each token depends on the previous one. CoreML does not handle dynamic loops well. The practical approach is to run the encoder on CoreML/ANE and the decoder on CPU.

Apple’s own Speech framework (SFSpeechRecognizer) uses a combination of ANE and GPU internally. It is the most power-efficient option for live captioning on Mac, but it has limitations: it requires an internet connection for the initial language model download (even though transcription runs on-device), it supports fewer languages than Whisper, and you cannot control the model version.

When CoreML matters: If you are building a Mac app that needs real-time or near-real-time transcription (e.g., live captions, voice typing), the CoreML encoder path with a small model on the decoder can be power-efficient. MLX Whisper often wins on batch throughput, while CoreML can be attractive for app integration.

Accuracy: Whisper large-v3 vs Cloud APIs

Cloud ASR APIs have a head start in hosted features, language-specific tuning, diarization, and custom post-processing.

Whisper large-v3 can be competitive, depending on the dataset:

Test Set	Whisper large-v3	Google Chirp	Azure Speech	Notes
LibriSpeech clean	3.5% WER	2.8% WER	3.1% WER	Studio-recorded English
LibriSpeech other	6.8% WER	5.9% WER	6.2% WER	Accented/noisy English
Common Voice (en)	8.1% WER	7.0% WER	7.5% WER	Crowdsourced, various conditions
TED-LIUM 3	5.2% WER	4.1% WER	4.5% WER	Lecture-style English
Multilingual (FLEURS)	12.4% WER	10.8% WER	11.5% WER	102 languages

These numbers are illustrative benchmark snapshots. Check current vendor and model documentation before making accuracy claims.

The pattern in many comparisons is that cloud APIs can lead on difficult audio and advanced hosted features. But “better” depends on context. For clean, studio-quality audio, the gap may be small enough that workflow and privacy matter more.

The real difference is not accuracy — it is features. Cloud APIs offer:

Speaker diarization: “Who said what” for multi-speaker audio.
Custom vocabularies: Domain-specific terms (medical, legal, technical).
Real-time streaming: True streaming with sub-200ms latency.
Punctuation restoration: Trained specifically for punctuation, not the general LLM-style prediction Whisper uses.

Whisper’s counter-advantages:

Low marginal cost at scale: Once you have the hardware, additional transcription avoids per-minute API billing.
Privacy: Audio never leaves your machine.
Latency: No network round-trip. For batch transcription, local processing can be faster than upload, queueing, processing, and download.
Offline: Works on a plane, in a remote cabin, or in an air-gapped environment.

Bottom line: If you need hosted diarization, custom vocabularies, or managed real-time infrastructure, a cloud API may be the right choice. If you value privacy, low marginal cost, or offline work, local Whisper-based workflows are worth considering.

Language Support

Whisper supports 100+ languages out of the box. The model detects the spoken language automatically, or you can hard-code it:

# Auto-detect
result = model.transcribe("spanish_audio.mp3")
print(result["language"])  # "es"

# Hard-code for reliability
result = model.transcribe("spanish_audio.mp3", language="es")

The quality varies by language. Whisper performs best on European languages with large representation in the training data (English, Spanish, French, German, Italian, Portuguese, Dutch). It performs adequately on East Asian languages (Japanese, Chinese, Korean, Thai) with slightly higher WER. It struggles on low-resource languages like indigenous or dialectal variants.

For English specifically, the large-v3 model produces word-level timestamps that are accurate enough to generate subtitles with proper sync:

result = model.transcribe("podcast.mp3", word_timestamps=True)

# Generate SRT subtitles
with open("subtitles.srt", "w") as f:
    for i, segment in enumerate(result["segments"], 1):
        start = segment["start"]
        end = segment["end"]
        text = segment["text"].strip()
        f.write(f"{i}\n")
        f.write(f"{format_timestamp(start)} --> {format_timestamp(end)}\n")
        f.write(f"{text}\n\n")

Language detection is useful but not perfect. For code-switched audio, test your own recordings before relying on automatic detection.

Practical Mac Workflow: Record → Transcribe → Export

Here is a complete workflow you can build in a single Python script. It records from the Mac microphone, transcribes with MLX Whisper, and exports to Markdown and SRT.

Step 1: Record Audio

On Mac, you can install sox for command-line recording, or use Python’s sounddevice:

pip install sounddevice scipy numpy mlx-whisper

import sounddevice as sd
import numpy as np
import scipy.io.wavfile as wav

def record_audio(filename="recording.wav", duration=60, samplerate=16000):
    print(f"Recording for {duration} seconds...")
    audio = sd.rec(
        int(duration * samplerate),
        samplerate=samplerate,
        channels=1,
        dtype="float32",
    )
    sd.wait()
    wav.write(filename, samplerate, np.int16(audio * 32767))
    print(f"Saved {filename}")
    return filename

Step 2: Transcribe with MLX Whisper

import mlx_whisper

def transcribe_audio(filename):
    result = mlx_whisper.transcribe(
        filename,
        path="large-v3",
        language="en",
        word_timestamps=True,
    )
    return result

Step 3: Export Formats

def export_markdown(result, output="transcript.md"):
    with open(output, "w") as f:
        f.write("# Transcript\n\n")
        for segment in result["segments"]:
            start = format_timestamp(segment["start"])
            end = format_timestamp(segment["end"])
            f.write(f"[{start} → {end}]  \n")
            f.write(f"{segment['text'].strip()}\n\n")
    print(f"Markdown saved to {output}")

def export_srt(result, output="subtitles.srt"):
    with open(output, "w") as f:
        for i, segment in enumerate(result["segments"], 1):
            start = format_timestamp_srt(segment["start"])
            end = format_timestamp_srt(segment["end"])
            f.write(f"{i}\n")
            f.write(f"{start} --> {end}\n")
            f.write(f"{segment['text'].strip()}\n\n")
    print(f"SRT saved to {output}")

def format_timestamp(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    return f"{h:02d}:{m:02d}:{s:02d}"

def format_timestamp_srt(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds - int(seconds)) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

Step 4: Run It

filename = record_audio(duration=120)
result = transcribe_audio(filename)
export_markdown(result)
export_srt(result)
print("Done. Full transcript:")
print(result["text"])

This pattern scales to batch processing entire podcast directories:

import glob

for audio_file in glob.glob("podcasts/*.mp3"):
    print(f"Transcribing {audio_file}...")
    result = transcribe_audio(audio_file)
    export_markdown(result, audio_file.replace(".mp3", ".md"))
    export_srt(result, audio_file.replace(".mp3", ".srt"))

Pro Tip: Split Long Audio

Whisper processes 30-second chunks internally, but long audio files can cause memory issues with the large model. Split before transcribing:

from pydub import AudioSegment

audio = AudioSegment.from_file("long_podcast.mp3")
chunk_length_ms = 10 * 60 * 1000  # 10 minutes

for i, start in enumerate(range(0, len(audio), chunk_length_ms)):
    chunk = audio[start : start + chunk_length_ms]
    chunk.export(f"chunk_{i:03d}.wav", format="wav")
    result = transcribe_audio(f"chunk_{i:03d}.wav")
    # Append to combined transcript

On Apple Silicon with large-v3 and MLX, throughput varies widely by Mac model, memory, thermal conditions, and audio complexity. Benchmark your own workload before choosing chunk size.

Model Size vs Speed vs Accuracy vs RAM

Here is the reference table for Mac users choosing a model:

Model	Params	Download	Peak RAM	WER (En)	Speed (M2 Max)	Best For
tiny	39M	75 MB	1 GB	7.7%	200x	Real-time, low-resource
base	74M	142 MB	1.2 GB	5.8%	100x	Live captions
small	244M	466 MB	2 GB	4.8%	40x	General transcription
medium	769M	1.5 GB	3.5 GB	4.0%	15x	Accuracy-focused batch
large-v3	1.55B	3.1 GB	6 GB	3.5%	8x	Maximum accuracy

Speed column examples are hardware-specific MLX Whisper fp16 snapshots. Treat them as rough guidance, not guarantees.

The memory figures matter most for the large-v3 model. On a Mac with 8GB RAM, large-v3 will push into swap, slowing everything down. With 16GB or more, it fits comfortably.

Practical recommendations by Mac model:

MacBook Air (M1/M2/M3, 8GB RAM): Use small or medium. large-v3 will work but the system will swap heavily.
MacBook Pro (M2/M3/M4, 16GB+): Use large-v3 with MLX. It fits in memory and transcribes fast enough for interactive use.
Mac Mini / Mac Studio: Consider large-v3 if you have enough memory. These machines often have better thermal headroom for sustained transcription.
Any Mac, low power priority: tiny or base with CoreML encoder path. Good enough for dictation and live captions.

Privacy Benefits of Local Transcription

Cloud ASR sends your audio to servers you do not control. For a podcast transcription this might be fine. For a doctor’s dictation, legal deposition, or internal company meeting, it may require policy, legal, or compliance review.

Local transcription reduces these risks:

No audio upload: The WAV file stays on your disk.
No ASR API logging: A local workflow avoids creating ASR-provider logs for the audio you transcribe.
No network requirement: Transcribe on a plane, in a remote area, or on an air-gapped machine.
No per-minute API billing: You pay for the hardware and local runtime instead of a per-minute cloud meter.

The HIPAA and GDPR implications still depend on your organization, jurisdiction, storage, sharing, access controls, and retention policy. Local transcription can reduce cloud exposure, but it does not replace compliance review.

For journalists, local transcription can reduce third-party exposure for source interviews. The transcript file on your encrypted Mac may have a smaller external attack surface than audio processed through a third-party API.

The tradeoff depends on the audio and workflow: local transcription may be less accurate on difficult audio and slower for very large batches. For many Mac users transcribing meetings, podcasts, interviews, or voice notes, MLX Whisper on an M-series Mac can be practical enough.

Building Toward a Complete Mac Audio Workflow

ASR is one half of the audio pipeline. The other half — text-to-speech — is equally important for Mac users who want a complete local voice workflow. You record or import audio, transcribe it to text, edit the text, then generate new audio from the edited text.

Spokio is an offline text-to-speech app for Mac, not a speech-to-text app. For now, tools like MLX Whisper give you a solid local ASR pipeline without uploading audio to cloud servers. For the TTS side, Spokio is powered by Chatterbox Turbo for English voice generation, with local voice cloning, batch export, MP3/WAV/AIFF/M4A export, and no cloud uploads for text, audio, or voice samples.

The ecosystem is moving fast. Apple’s MLX framework, Whisper fine-tunes, and CoreML tooling continue to evolve. If you are building on Mac, local ASR is already practical for many workflows.