Published Jun 02, 2026

Text Tokenization

Before a TTS model can process text, the raw string must be converted into a sequence of tokens — discrete units that the model can ingest. Tokenization strategy has a direct impact on pronunciation accuracy, language coverage, and model size.

Types of Tokenization

Character Tokenization

Each character becomes a token: “Hello” → [H, e, l, l, o].

Pros: Small vocabulary (26 letters + punctuation), covers any language, simple.

Cons: No built-in pronunciation knowledge — the model must learn spelling-to-sound rules from scratch. Long sequences (a 100-word paragraph becomes hundreds of tokens).

Phoneme Tokenization

Each phoneme becomes a token. The text is first converted to phonemes via G2P, then the phoneme sequence is tokenized: “Hello” → [h, ə, l, oʊ].

Pros: Pronunciation knowledge is baked into the tokenization — the model does not need to learn spelling rules. Shorter sequences than characters.

Cons: Requires a G2P system for each language. Errors in G2P propagate to the model. Different languages need different phoneme sets.

Byte-Pair Encoding (BPE)

Subword tokenization learned from data. Common letter sequences merge into single tokens: “Hello” → [“Hel”, “lo”] or [“He”, “llo”] depending on the trained merges.

Pros: Balances vocabulary size and sequence length. No language-specific phoneme knowledge required. Handles rare words by falling back to subword units.

Cons: May produce arbitrary splits that confuse pronunciation. Language-dependent — merges useful for English may be suboptimal for Japanese.

Impact on TTS Quality

Pronunciation accuracy: Phoneme tokenization produces the most reliable pronunciation because the model never sees spelling. BPE and character tokenization require the model to learn G2P implicitly, which works well for common words but can fail on rare terms.

Language coverage: Character tokenization handles any language seamlessly. Phoneme tokenization requires a phoneme inventory per language. BPE requires training data per language.

Sequence length: Character tokenization produces the longest sequences, increasing compute cost. Phoneme tokenization produces the shortest, reducing generation time.

In Practice

Most modern TTS models use either phoneme tokenization with a strong G2P backend, or BPE tokenization with large multilingual training data. For English-only TTS, phoneme tokenization typically gives better pronunciation. For multilingual models, BPE or mixed approaches are more practical.