Mel-spectrograms are the intermediate visual representation that most neural TTS models use to generate speech. They are a type of spectrogram where the frequency axis is transformed to the mel scale, which approximates human pitch perception.
A spectrogram is a visual representation of audio: time on the x-axis, frequency on the y-axis, and intensity as color. Dark regions show quiet frequencies; bright regions show loud frequencies.
A speech spectrogram reveals:
The mel scale maps physical frequency (Hz) to perceived pitch. Humans are better at discriminating low frequencies than high frequencies. The mel scale compresses high frequencies to reflect this:
mel = 2595 * log10(1 + f / 700)By using mel-spaced frequency bins, the representation allocates more detail to the frequencies where human hearing is most sensitive, making it more efficient for speech processing.
In a typical neural TTS pipeline, the acoustic model predicts mel-spectrograms from text. The vocoder then converts these mel-spectrograms into audio waveforms.
The mel-spectrogram serves as a compact, information-rich intermediate representation that decouples the acoustic model from the vocoder. This modularity allows different vocoders to be swapped without retraining the acoustic model.
Higher resolution mel-spectrograms (more frequency bins, shorter time steps) capture more detail but require more compute. Typical configurations use 80-128 mel bands with 12.5ms time steps. For production voiceover, higher resolution preserves more vocal detail at the cost of slower generation.