Real-Time Factor is the standard metric for measuring TTS generation speed. It is the ratio of the time taken to generate audio to the duration of the generated audio.
RTF = generation time (seconds) / audio duration (seconds)Model architecture — Autoregressive models typically have RTF between 0.5 and 2.0 depending on hardware. Non-autoregressive models achieve RTF below 0.1.
Model size — Larger models with more parameters are slower. An 82M-parameter model runs faster than a 1.7B-parameter model on the same hardware.
Hardware — Apple Silicon Neural Engine, NVIDIA GPU, and CPU have vastly different RTF for the same model. Local TTS on Mac benefits from the ANE for accelerated inference.
Batch size — Generating multiple audio samples in a batch can improve throughput, but batch size is limited by available memory.
Audio length — Some models have constant overhead per generation regardless of output length, making short clips relatively more expensive.
RTF should be measured on representative hardware with representative content. A model that achieves RTF 0.5 on an M3 Max may achieve RTF 3.0 on an Intel Mac. Always measure on the target deployment hardware.
For local TTS on Mac, RTF determines whether you can generate audio interactively or must wait for batch processing. Apple Silicon Macs with Neural Engine acceleration typically achieve RTF below 0.5 for small to medium models, making real-time generation practical for most use cases.