Published Jun 02, 2026

Real-Time Factor (RTF)

Real-Time Factor is the standard metric for measuring TTS generation speed. It is the ratio of the time taken to generate audio to the duration of the generated audio.

RTF = generation time (seconds) / audio duration (seconds)

What the Numbers Mean

  • RTF < 1 — Faster than real time. The model generates audio faster than it takes to play it. Suitable for real-time applications like voice assistants and live streaming.
  • RTF = 1 — Real-time speed. Generation keeps pace with playback.
  • RTF > 1 — Slower than real time. The model generates audio slower than it takes to play. Acceptable for batch processing where latency does not matter.
  • RTF < 0.1 — Ten times faster than real time. Suitable for high-throughput batch generation of long content.

What Affects RTF

Model architecture — Autoregressive models typically have RTF between 0.5 and 2.0 depending on hardware. Non-autoregressive models achieve RTF below 0.1.

Model size — Larger models with more parameters are slower. An 82M-parameter model runs faster than a 1.7B-parameter model on the same hardware.

Hardware — Apple Silicon Neural Engine, NVIDIA GPU, and CPU have vastly different RTF for the same model. Local TTS on Mac benefits from the ANE for accelerated inference.

Batch size — Generating multiple audio samples in a batch can improve throughput, but batch size is limited by available memory.

Audio length — Some models have constant overhead per generation regardless of output length, making short clips relatively more expensive.

Measuring RTF

RTF should be measured on representative hardware with representative content. A model that achieves RTF 0.5 on an M3 Max may achieve RTF 3.0 on an Intel Mac. Always measure on the target deployment hardware.

In Practice

For local TTS on Mac, RTF determines whether you can generate audio interactively or must wait for batch processing. Apple Silicon Macs with Neural Engine acceleration typically achieve RTF below 0.5 for small to medium models, making real-time generation practical for most use cases.