Published Jun 02, 2026

Chunking

Chunking is the strategy of splitting long text into smaller segments before feeding it to a TTS model, then stitching the generated audio back together. It is necessary because most TTS models have practical input length limits.

Why It Is Hard

Naive chunking — splitting at a fixed character count — produces audible seams. Each chunk is generated in isolation, so prosody resets at every boundary. The same phrase at the end of one chunk and the start of the next sounds like two different takes. Pauses get inserted at arbitrary positions.

Common Strategies

Sentence-boundary chunking splits at sentence boundaries, grouping sentences until the chunk approaches the model’s limit. Preserves natural speech rhythms because the model always starts and ends at grammatically complete units.

Paragraph-boundary chunking splits at paragraph breaks. Works well for non-fiction where paragraphs are self-contained. The natural pause between paragraphs masks chunk boundaries.

Sliding window with overlap uses overlapping chunks (each overlaps the previous by 2-3 sentences) and crossfades the overlapping region. Most computationally expensive but produces the smoothest output.

Crossfade and Seam Repair

When stitching chunk A and chunk B, a short crossfade (10-20ms at 24kHz) smooths the boundary. Silence trimming removes the leading and trailing silence that models often insert, preventing accumulated dead air across hundreds of chunks.

In Practice

For short clips (under 30 seconds), chunking is irrelevant. For long-form content — audiobooks, training narration, podcast episodes — chunking strategy determines whether the output sounds continuous or assembled. Sentence-boundary splitting with overlap is the recommended starting point.