G2P is the component that converts written text into a sequence of phonemes — the smallest units of sound in a language. It is the first step in any TTS pipeline and determines how accurately words are pronounced.
English has deeply irregular spelling. “Through,” “though,” “tough,” and “thought” all use the same “ough” pattern but produce four different pronunciations. A G2P system must handle these exceptions alongside regular rules.
Beyond words, G2P handles:
Rule-based G2P uses hand-written pronunciation rules. Espeak-ng is the most widely used open-source engine, supporting 100+ languages. It is fast and predictable but can sound robotic.
Dictionary-based G2P looks up words in a pronunciation lexicon (like the CMU Pronouncing Dictionary covering 134,000 English words), falling back to rules for unknown words. More accurate for common vocabulary.
Neural G2P uses sequence-to-sequence models trained on pronunciation data. Most accurate for complex languages but requires training data and compute. Modern TTS models often bake G2P directly into the end-to-end network.
G2P quality directly affects whether a TTS model sounds like it knows how to read. A strong G2P backend handles heteronyms (“read” vs “read”, “lead” vs “lead”), proper nouns, and domain-specific terminology without manual intervention.