Published Jun 02, 2026

Vocoder

A vocoder converts intermediate acoustic representations (typically mel-spectrograms) into raw audio waveforms. It is the final stage in a TTS pipeline and has a disproportionate impact on perceived naturalness.

The Problem It Solves

TTS models generate speech as a sequence of acoustic features — a representation of what frequencies are active at each moment. The vocoder turns that representation into sound waves that speakers or headphones can play. The quality of this conversion determines whether the output sounds clean or artifacts like buzziness, metallic timbre, or warbling.

Types

Classic vocoders (WORLD, STRAIGHT) use signal-processing algorithms. They are fast and stable but produce a buzzy, synthetic quality. Acceptable for low-resource applications.

Neural vocoders (WaveNet, HiFi-GAN, MelGAN) use neural networks to generate waveforms. They produce significantly more natural audio — cleaner consonants, smoother pitch transitions, and fewer artifacts. The tradeoff is higher computational cost.

Universal vocoders are trained on diverse data and can decode acoustic features from many different TTS models without retraining. Most modern TTS pipelines use a universal neural vocoder.

In Practice

Vocoder quality is often the difference between “clearly TTS” and “sounds almost human.” A strong acoustic model paired with a weak vocoder still sounds synthetic. A decent acoustic model paired with a strong vocoder can produce surprisingly natural output. For production voiceover work, the vocoder choice is as important as the model choice.

Try Spokio for Mac.

Offline text-to-speech for Mac. Local voice cloning, batch export, and no cloud uploads for your text, audio, or voice samples.

macOS 15.6+ | Apple Silicon & Intel | English only

hi@spokio.pro

Vocoder

The Problem It Solves

Types

In Practice

Try Spokio for Mac.

Product

Features

Use Cases

Compare