Voice Isolator on Mac: AI-Powered Source Separation for Clean Audio

Every audio editor runs into the same problem sooner or later: a great take ruined by background noise, a podcast guest recorded on a cheap mic in a noisy room, or an old interview where the voices bleed into each other.

For years, fixing this meant either re-recording, or spending hours with spectral editing tools — cutting out noise by hand, sample by sample. AI source separation changed that. Models like Meta’s Demucs and Deezer’s Spleeter can now split a mixed audio file into stems: vocals, drums, bass, other instruments, and background noise. The results are often useful as a first pass in post-production workflows.

This guide covers what source separation is, how the models work under the hood, how to run them on a Mac with Apple Silicon acceleration, and which tools belong in your pipeline.

What Is Audio Source Separation?

Source separation is the task of decomposing a mixed audio signal into its constituent parts. Given a single recording — say, a podcast with two people talking over Skype and a fan humming in the background — a separation model outputs individual tracks: Speaker A, Speaker B, and the room noise, each as a separate audio file.

The problem is fundamentally ill-posed. A mixed waveform is a single stream of amplitude values over time. If two sounds occupy the same frequency at the same moment, the model cannot directly measure how much energy belongs to each source. It has to infer that from learned patterns: the statistical structure of speech versus music, the harmonic profile of a piano versus a guitar, the temporal smoothness of a voice versus the percussive transients of a drum hit.

That is why source separation took so long to reach usable quality. Traditional approaches — ICA (independent component analysis), NMF (non-negative matrix factorization), beamforming with microphone arrays — all required either multiple input channels or strong assumptions about the sources. Deep learning removed those constraints.

Modern separation models learn a mapping from mixed spectrograms to individual source spectrograms, trained on thousands of hours of isolated tracks that were artificially mixed together. The model never sees the “true” clean sources at inference time — but it has seen enough combinations during training to guess them accurately.

Separation Tasks

Most models tackle one or more of these tasks:

Vocals / accompaniment separation — extracting the singing voice from a full music mix
Music source separation — splitting into drums, bass, vocals, and other (the “musdb18” four-stem standard)
Speech / noise separation — isolating a single voice from environmental noise
Diarization-style separation — separating multiple speakers in a conversation
Audio restoration — removing clicks, pops, hum, and hiss from archival recordings

The same underlying architecture handles all of these. The difference is training data and output configuration.

Key Models in 2026

Meta’s Demucs (HT-Demucs)

Demucs is a widely used open-source separation model. Originally published by Meta’s FAIR lab in 2019, it has gone through several revisions. Hybrid Transformer Demucs (HT-Demucs) combines convolutional layers with a transformer backbone and is widely used as a strong open-source baseline.

Demucs v4 operates on raw waveforms and spectrograms simultaneously. The hybrid design processes time-domain signals through a U-Net of convolutional layers while also computing a spectrogram representation that feeds through a transformer encoder. The two branches fuse at multiple resolutions, giving the model both precise temporal localization (from the waveform path) and rich frequency context (from the spectrogram path).

The pretrained models include:

htdemucs — 4 stems: vocals, drums, bass, other
htdemucs_ft — fine-tuned version with slightly better separation quality
htdemucs_6s — 6 stems: adds piano and guitar as separate outputs
hdemucs_mmi — trained with multi-scale inference for improved stability

Demucs runs on CPU, CUDA, and Apple’s MPS backend. Actual speed depends on model variant, audio length, memory, PyTorch version, and Mac hardware.

Deezer’s Spleeter

Spleeter was released by Deezer in 2019 and became the mainstream gateway into source separation. It uses a U-Net architecture with a pretrained ResNet-like encoder and produces 2, 4, or 5 stems depending on the model variant.

Spleeter was the first model that made separation accessible to non-researchers. Install two pip packages and run a single CLI command. The quality is noticeably behind Demucs v4 — especially on the “other” stem, where Spleeter tends to leak vocal artifacts — but it uses significantly less memory and runs faster on CPU.

Spleeter’s main limitation is its training data. The 5-stem model was trained on the MUSDB18 dataset, which contains only 150 full-length songs. Demucs was trained on an internal Meta dataset that is approximately 100x larger.

RNNoise

RNNoise is a lightweight recurrent neural network for real-time noise suppression. It is not a full separation model — it does not produce separate stems — but it is excellent at one specific task: removing continuous background noise from speech while keeping the voice intact.

RNNoise uses a GRU (gated recurrent unit) architecture operating on spectral features extracted via a sliding window. It was trained on thousands of hours of noisy speech and learns to estimate a per-frequency gain mask. The model is small enough to run on a Raspberry Pi or inside a browser via WebAssembly.

Because RNNoise is extremely efficient and real-time capable, it is embedded in many audio tools under the hood: Audacity’s noise reduction, noise-suppression-for-voice, and browser-based voice chat apps.

Meta’s Voicebox (Research Precursor)

Voicebox is Meta’s text-conditioned speech generation model that also performs separation as a side effect. It was trained using flow-matching on speech tokens and can perform infilling, denoising, and style transfer — including removing background noise from a recording by generating a clean version conditioned on the noisy input.

Voicebox was never released as open weights, and Meta has not shipped a production version. But its architecture influenced the current generation of conditional speech models, and many of its ideas appear in more recent work.

Emerging Models: Open-Unmix, MossFormer, and Bandit

Beyond the big three, several newer architectures deserve attention:

Open-Unmix — a simple bidirectional LSTM-based model that serves as a strong baseline. It is easy to train, modify, and deploy, and is a good starting point if you want to experiment with separation on your own data.
MossFormer — a gated transformer with a dual-path structure that achieves competitive SDR (signal-to-distortion ratio) scores on MUSDB18 with lower computational cost than Demucs.
Bandit — a band-split transformer that processes different frequency ranges independently before merging. It is too new for broad adoption but shows promise for multi-stem separation with fewer parameters.

How These Models Work Under the Hood

If you are the kind of person who just wants to run the tool and move on, skip this section. If you want to understand why Demucs sounds better than Spleeter, or why RNNoise can run in real-time while Demucs cannot, read on.

Spectrograms and the Time-Frequency Tradeoff

Almost all separation models operate on spectrograms rather than raw waveforms. A spectrogram is a 2D representation of audio: time on the x-axis, frequency on the y-axis, and amplitude represented as color (or pixel intensity). It transforms the waveform via the Short-Time Fourier Transform (STFT), which chops the audio into overlapping windows and computes the frequency content of each window.

The separation task becomes an image-to-image translation problem: given a mixed spectrogram, predict the spectrogram of each source. The model outputs a mask — a matrix of values between 0 and 1 for each time-frequency bin — that indicates how much of that bin belongs to each source. This is called spectrogram masking.

Masking works because sounds rarely occupy the exact same time-frequency bins with equal energy. Speech has a characteristic harmonic structure with formants concentrated in specific frequency ranges. A dog bark is broadband but brief. A bass guitar is concentrated at low frequencies with a steady harmonic series. The model learns to recognize these patterns and assign energy accordingly.

The limitation is that overlapping sounds in the same bin cannot be perfectly separated. Two voices in the same frequency range at the same moment produce an irreducible ambiguity. That is why speaker separation (two people talking at once) is harder than voice-from-music separation — the overlapping is more extensive.

U-Net Architectures

The most common architecture for separation is the U-Net, originally developed for biomedical image segmentation. It consists of:

An encoder path that progressively downsamples the input spectrogram through convolutional layers, extracting higher-level features at each stage
A bottleneck at the lowest spatial resolution where the model captures global structure
A decoder path that upsamples back to the original resolution, with skip connections from the encoder that preserve fine-grained detail

The U-Net is effective for separation because the encoder captures “what” is in the audio (which sources, which instruments), and the decoder maps that back to “where” in the time-frequency grid each source belongs. The skip connections ensure that high-frequency transients — like the attack of a snare drum or a plosive in speech — are not lost during downsampling.

Spleeter uses a vanilla U-Net with 12 layers. Demucs v4 uses a hybrid U-Net where some layers operate on the waveform directly while others operate on the spectrogram.

Hybrid Transformer Models

HT-Demucs adds a transformer encoder on top of the convolutional U-Net. The transformer processes the bottleneck features with multi-headed self-attention, which lets the model reason about long-range dependencies in the audio — patterns that span multiple seconds rather than local milliseconds.

Convolutional layers have a fixed receptive field (e.g., 200ms). If a guitar riff repeats every 2 seconds and a vocal phrase is 3 seconds long, the conv layers see only local slices. The transformer, using positional embeddings and attention across the full time dimension, can correlate a guitar recurrence in bar 4 with the same pattern in bar 8. This global reasoning improves separation quality significantly on music, where structural repetition is the norm.

The tradeoff is computational cost. Transformer self-attention scales quadratically with sequence length. HT-Demucs handles this by operating on a downsampled feature map at the bottleneck rather than the full spectrogram, keeping the sequence length manageable.

Perceptual and Adversarial Losses

A model trained only to minimize spectrogram reconstruction error tends to produce blurry outputs — it averages over possible clean versions rather than picking one. To get sharp, natural-sounding separations, modern models use additional loss terms:

Multi-resolution STFT loss — computes reconstruction error at multiple FFT window sizes, forcing the model to be accurate at both fine and coarse time scales
Adversarial loss — a discriminator network trained to distinguish model outputs from real clean recordings. The separator learns to fool the discriminator, which produces more realistic (and less artifact-ridden) results
SDR-based losses — directly optimizing the signal-to-distortion ratio, the standard evaluation metric for separation

Demucs v4 uses all three. That is a big part of why it sounds cleaner than Spleeter, which was trained with only L1 spectrogram loss.

Running Demucs Locally on Mac

Setting up Demucs on macOS is straightforward, especially if you are already using Python for audio work. Here is the full workflow.

Installation

pip install torch torchaudio --index-url https://download.pytorch.org/whl/nightly
pip install demucs

The nightly PyTorch build includes the latest MPS (Metal Performance Shaders) backend improvements. Stable builds also work but may be slightly slower on certain operations.

Checking MPS Support

python -c "import torch; print(torch.backends.mps.is_available())"

If this prints True, you are ready. Demucs will automatically use the MPS device when available.

Separating a File

demucs --two-stems=vocals input_file.mp3

This runs the default htdemucs model and extracts vocals and accompaniment into separate files. The output lands in ./separated/htdemucs/input_file/.

For best quality:

demucs --two-stems=vocals --shifts=5 --overlap=0.25 --clip-mode=rescale input_file.wav

The --shifts flag enables equivariant stabilization: the input is shifted by small amounts and the predictions are averaged. This reduces artifacts at the cost of 5x processing time. --overlap controls the blending between STFT windows — higher values reduce boundary artifacts.

MPS Performance on Mac Hardware

MPS acceleration can work well, but exact timing varies by chip, memory, model variant, PyTorch version, and settings such as --shifts. Very long audio files may trigger memory pressure on lower-memory machines. The workaround is to split the file into segments before separation. Certain torch operations may fall back to CPU if no MPS kernel exists, causing a performance penalty on some model variants.

Using the 6-Stem Model

demucs --name=htdemucs_6s input_file.wav

This splits into vocals, drums, bass, piano, guitar, and other. Each source gets its own WAV file. The 6-stem model is slightly larger and takes about 20% longer to run, but it is useful if you need isolated piano or guitar tracks for remixing.

Batch Processing

for f in *.wav; do
  demucs --two-stems=vocals "$f"
done

For large batches, consider using the Python API directly with multiprocessing:

import torch
from demucs import separate

model = separate.load_model("htdemucs")
files = ["recording1.wav", "recording2.wav", "recording3.wav"]

with torch.inference_mode():
    for f in files:
        separate.separate(model, f, device="mps")

Real-Time vs Offline Processing

A critical distinction in source separation tools: real-time versus offline.

Offline Processing

Demucs and Spleeter are offline models. They process the entire audio file before returning results. This lets them use bidirectional context — the model can look ahead in the audio to make better decisions. When Demucs separates a vocal at timestamp 10s, it has already “seen” timestamps 10-30s. That forward information helps it distinguish between a voice that is about to start singing (needs separation) and room rumble that stays constant (needs suppression).

Offline processing is the right choice for:

Podcast post-production
Music mastering and remixing
Archival audio restoration
Batch processing of recorded material

Real-Time Processing

RNNoise and lightweight noise suppression models can operate in real time with limited lookahead. They cannot use much future context, which can make their separation quality lower than offline models, but latency can be low enough for live workflows.

Real-time separation is useful for:

Live streaming and broadcasting
Video conferencing
Voice chat in games
In-ear monitoring for recording studios
Live transcription with noisy feeds

The gap between real-time and offline quality is narrowing, but offline models still tend to be stronger on challenging material with heavy overlap.

Practical Applications for Mac Creators

Podcast Editing: Removing Background Noise

The most common use case for source separation on a Mac is cleaning up podcast recordings. A typical scenario: you recorded an interview over Zoom, the guest had a laptop fan running, and there is a persistent HVAC hum in the background.

The workflow:

Run Demucs with --two-stems=vocals to extract the voice track
The “noise” stem contains everything that is not voice
In your DAW (Logic Pro, Reaper, or Audacity), mix the clean vocal stem with a low level of the original to preserve natural room tone
Apply a high-pass filter at 80Hz to remove any residual rumble

The result can be a cleaner voice track without the hollow “underwater” sound that aggressive noise gating sometimes produces.

For extreme cases — a recording with construction noise, wind, or traffic — run the file through twice: first Demucs for broadband separation, then RNNoise as a second pass to catch any remaining stationary noise. The two models catch different types of artifacts.

Extracting Vocals for Remixes and Samples

Music producers on Mac frequently need isolated vocals from commercial tracks. Demucs handles this well, but there are important caveats:

Bitrate matters. A 320kbps MP3 will separate noticeably cleaner than a 128kbps one. The compression artifacts in low-bitrate audio confuse the model and produce warbly-sounding vocal tracks.
Reverb leaks into the instrumental stem. HT-Demucs often leaves the reverb tail of a vocal in the “other” stem. You can recover it by mixing a small amount of the accompaniment stem back into the vocal track.
Doubled vocals. If a recording has layered vocal takes (common in pop production), the model will output them as a single stem. There is no built-in way to separate individual takes from the same singer.

Despite these limitations, Demucs vocal extraction can be good enough for remixing, sampling, and karaoke track creation, depending on the source material and licensing rights. The 6-stem variant adds piano and guitar isolation, which is useful for extracting instrumental parts for legitimate remix or practice workflows.

Cleaning Interview Recordings

Multi-speaker separation is harder than voice-from-music separation. When two people are speaking — especially in the same frequency range — the model has less information to distinguish them.

Demucs does not have a built-in speaker separation model, but the “vocals” stem can serve as a combined speech track that removes the background. For actual speaker diarization (separating Speaker A from Speaker B), specialized tools like pyannote-audio or cloud-enhanced pipelines like Rev are more appropriate.

Combining source separation with diarization can improve results: run Demucs to clean the audio, then run pyannote to assign segments to speakers, and transcribe each speaker’s turn separately.

Restoring Archival Audio

Old recordings — cassette tapes, vinyl rips, field recordings from the 1970s — present a specific set of problems: tape hiss, wow and flutter, broadband crackle, and limited frequency range.

Demucs was trained on music, not archival noise. For straight noise removal, RNNoise or dedicated restoration plugins like iZotope RX are more effective. But Demucs can help by separating speech from music beds in old radio broadcasts or isolating voice from mixed-content archival footage.

The key insight: if the archival recording has distinct sources (announcer voice versus backing music), the model may separate them reasonably well. If it is a single source with environmental noise (a field recording of someone speaking with wind noise), Demucs may be less effective.

Quality Comparison: Demucs v4 vs Spleeter vs Cloud APIs

Objective Metrics

The standard evaluation metric for source separation is SDR (Signal-to-Distortion Ratio), measured in dB. Higher is better. These numbers are from the MUSDB18 test set:

Model	Vocals SDR	Drums SDR	Bass SDR	Other SDR	Average
HT-Demucs (v4)	8.2	7.6	6.8	5.1	6.9
Demucs v3	7.3	6.9	5.9	4.4	6.1
Spleeter 5-stem	5.9	5.1	4.2	3.3	4.6
Open-Unmix	4.8	4.5	3.8	2.9	4.0
Bandit (2025)	8.0	7.4	6.5	4.9	6.7

On many public benchmarks, HT-Demucs-style models are strong across stems. Results still vary by dataset, source material, and model version.

Subjective Listening

Metrics tell part of the story but do not capture artifact type. Listeners consistently report:

Spleeter produces a “swirly” chorus effect on vocals, especially during instrumental-heavy sections. The voice sounds phasey and slightly detuned.
Demucs v3 eliminated the phase issues but introduces occasional “digital gurgling” on sustained vocal notes.
Demucs v4 (HT-Demucs) is largely free of both. The primary remaining artifact is a slight reduction in high-frequency air above 12kHz. The vocal stem sounds slightly darker than the original.
Cloud APIs (like LALAL.AI and Acapella Extractor) use proprietary models that often match or slightly exceed HT-Demucs on vocals but trail on drums and bass.

Demucs vs Cloud: When to Choose Which

Cloud APIs have two advantages: zero setup and occasional quality wins on specific material. LALAL.AI, for instance, uses a custom Phoenix model that handles heavily distorted electric guitars better than Demucs.

But cloud services have meaningful downsides:

Upload time for large files (a 60-minute podcast WAV is ~600MB)
Privacy concerns — your audio leaves your machine
Recurring or usage-based costs
No batch processing without API integration
Vendor lock-in if you depend on a specific model’s output characteristics

Demucs sidesteps many of these. Once installed, separation can run privately and offline. On many Apple Silicon Macs, local processing is fast enough to be practical.

GUI Apps for Source Separation on Mac

Not everyone wants to use the command line. Several Mac applications wrap Demucs and other models in a visual interface:

App	Engine	Price	Notes
STEVE AI	Demucs + custom	Free / paid tiers	Batch processing, cloud + local hybrid
LALAL.AI	Phoenix (proprietary)	Free / paid tiers	Cloud separation, Mac desktop app
Vocal Remover	MDX-Net	Free	Web-based, limited to 10 min files
Acapella Extractor	Demucs + MDX	Paid	Native Mac app, supports MPS
Audacity + RNNoise plugin	RNNoise	Free	Noise suppression only, not full separation
UVR (Ultimate Vocal Remover)	Demucs + MDX + VR	Free	Powerful GUI with many models
Logic Pro (built-in)	Apple stem separation	Paid	Built-in stem splitter

Ultimate Vocal Remover (UVR) deserves special mention. It is a free, open-source Python GUI that bundles many separation models and lets you chain them — run Demucs first, then apply MDX-Net as a second pass for artifact reduction.

CLI Tools for Power Users

If you prefer the terminal or need to integrate separation into a scripted pipeline:

Tool	Install	Notes
Demucs	`pip install demucs`	Best quality, MPS support
Spleeter	`pip install spleeter`	Faster, lower quality
Open-Unmix	`pip install openunmix`	Good baseline, easy to customize
FFmpeg (filter)	`brew install ffmpeg`	Built-in `pan` filter for center-channel extraction (basic, no AI)
Noise Suppression for Voice	`brew install --cask rnnoise`	Real-time RNNoise, PulseAudio/Loopback integration

FFmpeg’s pan filter is worth understanding because it highlights what AI models actually improve upon. You can extract the center channel from a stereo mix by assuming that centered sounds (usually vocals) are identical in both channels:

ffmpeg -i stereo_input.wav -af "pan=mono|c0=FL/2+FR/2" center_output.wav

This is the “karaoke trick.” It works well for pop music where vocals are panned center. It fails on any material where instruments also sit in the center, or where the recording uses stereo effects like reverb or delay on the vocal send. AI models handle all of those cases by learning what a voice actually sounds like, rather than assuming a pan position.

Putting It Together: A Mac Audio Production Pipeline

Here is how source separation fits into a complete Mac-based audio workflow that runs entirely offline:

Record or receive audio. Capture in Logic Pro, Audacity, or directly via a USB interface.
Separate stems with Demucs. Run demucs --two-stems=vocals to extract the voice track. Processing time depends on model, file length, and hardware.
Apply secondary noise suppression. For persistent background noise, run RNNoise on the vocal stem:
```
ffmpeg -i vocal_stem.wav -af "arnndn=model=rnnoise-model:output=output.wav" clean_vocal.wav
```
Generate replacement narration with Spokio when appropriate. If a section is too damaged to salvage and you have prepared replacement text, Spokio can generate English speech locally with Chatterbox Turbo.
Mix and master in your DAW. Layer the clean vocal stem, the RNNoise output, and a low-level blend of the original track for natural room tone. Add compression, EQ, and limiting.

Voice isolation is a natural complement to TTS in audio production workflows. While Spokio generates English speech from text locally with voice cloning, batch export, and MP3/WAV/AIFF/M4A export, tools like Demucs clean up existing recordings.

Privacy Considerations

One reason Mac creators are adopting local separation tools is privacy. When you upload a recording to a cloud separation service, you are sending potentially sensitive audio — client meetings, unreleased music, private interviews — through someone else’s server. The terms of service often grant the service broad rights to process and store your audio.

Local separation with Demucs changes that calculus:

The audio can stay on your machine
No cloud account is required for local processing
No provider-side logging is involved in the local run
No cloud usage meter applies to the local run
You keep control of the local output files

This matters for podcasters interviewing whistleblowers, musicians sharing unreleased material with collaborators, and editors working on confidential corporate content. Local AI separation makes it possible to clean audio without compromising privacy.

The Future of Source Separation

Source separation quality has improved dramatically in the last five years — from barely usable to nearly transparent. The trajectory suggests:

Better separation on difficult material. Newer models should keep improving on overlap, reverb, and noisy recordings.
Lower-latency separation. Hardware acceleration and model distillation should keep reducing latency for live workflows.
More robust training data. Unusual acoustics — outdoor recordings, phone calls, archival material — remain challenging.
Broader DAW integration. More audio and video tools are likely to include built-in stem separation over time.

For Mac creators today, local tools are already useful for many professional and semi-professional workflows. Demucs can run well on Apple Silicon, GUI apps are mature enough for practical use, and local processing is worth comparing when privacy and repeatability matter.

Install Demucs, grab a noisy recording, and see what happens. The first time you hear a clean voice stem pulled out of a messy mix, the effortlessness of it will change how you think about audio production.