Many TTS models are designed for one thing: reading text aloud. Give them a paragraph, they generate audio. The result can work well for voiceovers, but conversation needs different timing and context.
Sesame CSM (Conversational Speech Model) is built for the other direction. Instead of focusing only on pre-written text, it generates speech from text and audio inputs with conversational timing cues.
Released under Apache 2.0, CSM represents a different approach to speech synthesis.
What makes CSM different
Traditional TTS follows a pipeline: text → linguistic features → acoustic features → waveform. The model treats speech as text-to-read.
CSM treats speech as conversation-to-have. It is built as a model that generates RVQ audio codes from text and audio inputs. That gives it more context for conversational flow than a text-only TTS model.
| Traditional TTS | Sesame CSM | |
|---|---|---|
| Input | Text | Text + audio context |
| Output | Audio | Audio with conversational timing |
| Turn-taking | Usually external logic | Context-informed timing |
| Emotional range | Often via tags or SSML | Context-informed prosody |
| License | Various | Apache 2.0 |
| Parameters | 82M-4B | 1B |
Key capabilities
Natural turn-taking
CSM is designed to model pauses and conversational rhythm as part of generation. That can make AI voice interactions feel less scripted when the surrounding system supports real dialogue.
Audio context awareness
CSM accepts audio input alongside text. With conversation history, it can generate speech that better fits the surrounding flow. This is different from text-only models that generate speech without acoustic context.
Emotional prosody without tags
Many TTS models require explicit emotion tags like [laugh] or [whisper] to change delivery. CSM is designed to infer prosody from conversational context, though results depend on prompt, history, and model behavior.
Why CSM matters for voice agents
The voice agent market — AI receptionists, customer support bots, voice assistants — has been constrained by TTS quality. Users tolerate robotic delivery for short interactions but disengage when the voice feels scripted.
CSM directly addresses this. For applications where the AI needs to sound like a natural conversation partner, CSM is a meaningful step forward. The Apache 2.0 license is commercially interesting, though teams should still review license terms and dependencies before product use.
Current limitations
CSM is not a replacement for traditional TTS in every context:
- Voiceover work: CSM is not optimized for reading pre-written scripts cleanly. Traditional TTS models are often better for narration.
- Voice cloning: CSM does not emphasize voice cloning as a primary feature. It is designed for consistent conversational voice, not matching a specific speaker.
- Hardware requirements: The 1B parameter model is more practical with GPU acceleration for interactive use.
- Language focus: Current CSM releases focus on English.
The significance of CSM
CSM matters because it represents a architectural shift in how TTS models think about speech. Instead of “convert this text to audio,” the model asks “how would a person say this in a conversation?”
This shift may influence the next generation of TTS models, even those designed for non-conversational use. Techniques such as audio context awareness and conversational prosody modeling are likely to keep moving into mainstream speech systems.
Where Spokio fits
Spokio focuses on scripted voice generation — the kind creators need for voiceovers, narration, podcasts, and video content. CSM’s conversational strengths are useful in a different context: voice agents, real-time dialogue, and interactive AI.
But the architectural lessons from CSM apply broadly. As TTS models become more aware of conversational flow, the line between “reading” and “speaking” may continue to blur.
For Mac creators who need private, script-driven voice generation today, Spokio is an offline text-to-speech app powered by Chatterbox Turbo, with English voice generation, local voice cloning, batch export, MP3/WAV/AIFF/M4A export, and no cloud uploads for text, audio, or voice samples.
