Published Jun 02, 2026

Mel-Spectrograms

Mel-spectrograms are the intermediate visual representation that most neural TTS models use to generate speech. They are a type of spectrogram where the frequency axis is transformed to the mel scale, which approximates human pitch perception.

What a Spectrogram Shows

A spectrogram is a visual representation of audio: time on the x-axis, frequency on the y-axis, and intensity as color. Dark regions show quiet frequencies; bright regions show loud frequencies.

A speech spectrogram reveals:

  • Formants — dark horizontal bands showing the resonant frequencies of the vocal tract
  • Harmonics — vertical striations showing the periodic structure of voiced sounds
  • Fricatives — noise-like patterns in high frequencies for “s,” “sh,” “f” sounds
  • Silence — empty regions between words and phrases

The Mel Scale

The mel scale maps physical frequency (Hz) to perceived pitch. Humans are better at discriminating low frequencies than high frequencies. The mel scale compresses high frequencies to reflect this:

mel = 2595 * log10(1 + f / 700)

By using mel-spaced frequency bins, the representation allocates more detail to the frequencies where human hearing is most sensitive, making it more efficient for speech processing.

Role in TTS

In a typical neural TTS pipeline, the acoustic model predicts mel-spectrograms from text. The vocoder then converts these mel-spectrograms into audio waveforms.

The mel-spectrogram serves as a compact, information-rich intermediate representation that decouples the acoustic model from the vocoder. This modularity allows different vocoders to be swapped without retraining the acoustic model.

Resolution Tradeoffs

Higher resolution mel-spectrograms (more frequency bins, shorter time steps) capture more detail but require more compute. Typical configurations use 80-128 mel bands with 12.5ms time steps. For production voiceover, higher resolution preserves more vocal detail at the cost of slower generation.