A vocoder converts intermediate acoustic representations (typically mel-spectrograms) into raw audio waveforms. It is the final stage in a TTS pipeline and has a disproportionate impact on perceived naturalness.
TTS models generate speech as a sequence of acoustic features — a representation of what frequencies are active at each moment. The vocoder turns that representation into sound waves that speakers or headphones can play. The quality of this conversion determines whether the output sounds clean or artifacts like buzziness, metallic timbre, or warbling.
Classic vocoders (WORLD, STRAIGHT) use signal-processing algorithms. They are fast and stable but produce a buzzy, synthetic quality. Acceptable for low-resource applications.
Neural vocoders (WaveNet, HiFi-GAN, MelGAN) use neural networks to generate waveforms. They produce significantly more natural audio — cleaner consonants, smoother pitch transitions, and fewer artifacts. The tradeoff is higher computational cost.
Universal vocoders are trained on diverse data and can decode acoustic features from many different TTS models without retraining. Most modern TTS pipelines use a universal neural vocoder.
Vocoder quality is often the difference between “clearly TTS” and “sounds almost human.” A strong acoustic model paired with a weak vocoder still sounds synthetic. A decent acoustic model paired with a strong vocoder can produce surprisingly natural output. For production voiceover work, the vocoder choice is as important as the model choice.