Streaming TTS generates audio incrementally as the input text is being produced, rather than waiting for the complete text before starting synthesis. This is essential for real-time applications like voice assistants, live captioning, and interactive dialogue systems.
In a standard TTS pipeline, the model waits for the complete input text before generating any audio. For a 30-word sentence, this means a 1-2 second pause while the model processes the full input. In conversational contexts, this delay feels unnatural.
Streaming TTS starts producing audio as soon as the first few words are available, playing audio while later words are still being processed. The listener hears speech beginning almost immediately, with no perceptible delay between text input and audio output.
Streaming TTS processes text in chunks, typically at sentence or clause boundaries:
The challenge is balancing latency (how fast audio starts) against quality (chunk boundaries should not be audible).
Different applications have different latency tolerances:
| Application | Target Latency | Notes |
|---|---|---|
| Voice assistant | < 200ms | Must feel instantaneous |
| Live conversation | < 500ms | Natural turn-taking |
| Live captioning | < 1s | Matches subtitle timing |
| Streaming narration | < 2s | Acceptable for long-form |
Fixed window — Buffer N words or N seconds, then generate. Simple but may split sentences at awkward points, causing audible chunk boundaries.
Punctuation-based — Buffer until a sentence boundary is detected. Produces cleaner chunk boundaries but may introduce variable latency on long sentences.
Adaptive — Adjust the buffer size based on generation speed, network conditions, and content characteristics. Most complex but produces the best experience.
The output of each chunk must be stitched to the previous chunk without audible gaps or jumps. Techniques include:
Streaming TTS is primarily relevant for interactive applications. For pre-recorded voiceover, audiobook, and batch generation workflows, standard non-streaming generation is simpler and produces better quality. Most local TTS apps do not implement streaming because their use cases are batch-oriented. Cloud-based voice assistants and real-time captioning services are the primary deployment targets for streaming TTS.