Text-to-speech models are usually trained on narration data. Audiobooks, news articles, Wikipedia entries — text written to be read. That training makes them good at reading paragraphs aloud, but it also makes them sound like they are reading paragraphs aloud.
If you have ever heard a voice agent that sounds like it is reciting a script instead of having a conversation, you have experienced this mismatch.
ChatTTS is built for that gap. It is an open-source TTS model focused on conversational speech — dialogue, back-and-forth exchanges, natural speech patterns. Instead of optimizing primarily for narration, it aims for a more conversational delivery.
What ChatTTS does differently
ChatTTS was developed by 2noise and released under a permissive license. Its training data emphasizes conversational speech: interruptions, overlapping dialogue, varied pacing, natural filler sounds, and the rhythm of turn-taking.
| Traditional TTS | ChatTTS | |
|---|---|---|
| Training data | Audiobooks, articles, narration | Dialogue, conversation, speech |
| Output style | Read-aloud | Conversational |
| Prosody | Even, paragraph-optimized | Varied, dialogue-optimized |
| Best for | Voiceovers, narration | Chatbots, voice agents, dialogue |
| Punctuation handling | Follows written punctuation | Conversational pauses |
| License | Various | Open (MIT-like) |
Key features
Dialogue-aware prosody
ChatTTS aims to generate speech with more conversational rhythm. Sentences can have varied pacing, questions can rise in pitch, and responses can carry more natural delivery than paragraph-oriented narration models.
This is difficult to achieve with narration-trained models because they learn that text should be read at an even pace. ChatTTS learns that conversation has starts, stops, interruptions, and changes in energy.
Laugh and filler sounds
ChatTTS can produce conversational filler sounds — small laughs, thinking pauses (“uh”), emphasis breaths — when the generation context supports them. This can make generated conversations sound less scripted.
Multi-speaker dialogue support
ChatTTS is designed for back-and-forth dialogue and multi-speaker-style generation. When configured well, this is useful for:
- Podcast-style AI content with multiple hosts
- Customer service simulations with agent and customer roles
- Interactive narrative with character dialogue
- Language learning exercises with conversation partners
Where ChatTTS fits
Voice agents and chatbots
If you are building a voice agent that talks to customers, ChatTTS is likely to sound more natural than a narration-trained model. The conversational prosody makes the agent feel less like an automated system and more like a person following a script.
Dialogue prototyping
Game developers and interactive fiction writers can use ChatTTS to prototype dialogue between characters and hear how conversations flow before hiring voice actors.
Language learning
Conversational dialogue generated by ChatTTS works well for language learning applications, where learners need to hear natural speech patterns rather than carefully enunciated narration.
Current limitations
ChatTTS has some constraints worth noting:
- Narration quality: It is worse than Kokoro or Qwen3 for reading long-form content. Use it for dialogue, not voiceovers.
- Voice cloning: Cloning support is less developed than dedicated cloning models like Chatterbox or Voxtral.
- Language coverage: Strongest in English and Chinese. Other languages have more limited support.
- Hardware: Requires a GPU for real-time inference. Not as lightweight as Piper or NeuTTS Nano.
Why ChatTTS represents a category shift
ChatTTS is part of a broader trend in 2026: TTS models that specialize rather than generalize. Instead of one model doing everything adequately, the ecosystem is fragmenting into:
- Narration models (Kokoro, Piper): Strong for reading text aloud
- Conversational models (ChatTTS, Sesame CSM): Strong for dialogue
- Emotional models (Orpheus, Hume Octave): Strong for expressive range
- Cloning models (Chatterbox, Voxtral): Strong for voice replication
This specialization means choosing the right model depends more than ever on your specific use case. A model that sounds robotic in one context can be excellent in another.
Where Spokio fits
Spokio is a native Mac text-to-speech app powered by Chatterbox Turbo. It focuses on offline voice generation for creators, writers, educators, and privacy-conscious Mac users, with local voice cloning from short samples and batch export without cloud uploads for text, audio, or voice samples.
ChatTTS serves a different use case: real-time conversational AI. If your primary need is script-driven voice content with privacy and batch export, Spokio’s local TTS workflow on Mac is the practical choice.
