If you are building an app, service, or pipeline that generates speech at scale, per-minute TTS pricing is a trap you do not want to fall into.
ElevenLabs charges $0.09–0.10 per minute for its conversational AI agents. At 10,000 minutes per month (about 5.5 hours per day), that is $900–1,000. Deepgram’s voice agent API runs $0.075–0.163 per minute. PlayHT and Murf meter by characters with sharp overage rates. The costs add up fast when your app actually gets used.
The good news: you have better options. Here are five approaches to TTS that avoid the per-minute meter, ranked by effort-to-savings ratio.
1. One-Time Purchase Offline Apps
Best for: Individual developers, small teams, Mac or Windows users who want zero marginal cost with minimal setup.
A one-time-purchase offline app is the simplest escape from usage-based pricing. Buy once, generate forever. No meter, no characters counted, no minutes tracked.
| App | Platform | Price | Voice Cloning | Export Formats |
|---|---|---|---|---|
| Spokio | macOS | Free + $49.99 lifetime Pro | Yes (free tier) | MP3, WAV, AIFF, M4A |
| Balabolka | Windows | Free | No | MP3, WAV, OGG, WMA |
| macOS Spoken Content | macOS | Free (built-in) | No | No built-in export |
Spokio is a native macOS app that runs entirely offline. It is powered by Chatterbox Turbo and supports local voice cloning from short samples on the free plan. The lifetime Pro ($49.99) unlocks batch export, unlimited custom voices, and priority support — with no subscription and no per-character billing.
Balabolka is a free Windows application that uses your system’s SAPI voices. It reads PDFs, DOCX, EPUB, and more, and exports to audio files with no limits, ads, or subscriptions. The app is freeware; voice quality depends on what you have installed.
macOS Spoken Content is built into every Mac. Enable it in System Settings → Accessibility → Spoken Content. You get decent neural voices (downloadable for free) and system-wide text selection reading. No export built in, but for quick listening it is zero-cost and zero-setup.
Cost at volume: $0 after initial purchase. Year one amortized cost is under $0.001/min at 100 hours per month. After year one, it costs literally zero.
2. Free Built-In and Browser Voices
Best for: Zero-budget projects, prototyping, personal use, and apps where voice quality is not the priority.
Before committing to any paid TTS solution, check what you already have.
| Source | Platform | Voice Quality | Offline | Export |
|---|---|---|---|---|
| Microsoft Edge Read Aloud | Windows, Mac, Linux, Android | Good (Microsoft Neural) | Partial | No (use edge-tts Python lib) |
| Web Speech API | Chrome, Edge, Firefox, Safari, Opera | OS-dependent | OS-dependent | No |
| macOS Spoken Content | macOS | Decent (Apple Neural) | Yes | No |
Microsoft Edge Read Aloud uses Microsoft’s neural TTS voices, which are genuinely good — comparable to many paid services. It works on any platform Edge runs on. For developers, the edge-tts Python library (MIT license) accesses the same voices programmatically without requiring the browser or an API key.
Web Speech API is the simplest path to TTS in a browser app. A single line of JavaScript (window.speechSynthesis.speak(new SpeechSynthesisUtterance(text))) gives you speech on any modern browser. Quality depends entirely on the OS — Windows 11 neural voices are good, macOS voices are decent, Linux voices are robotic. No export, no voice cloning, no SSML in all implementations.
macOS Spoken Content is system-wide. Select text, press a keyboard shortcut, and the Mac reads it aloud. Voices are Apple’s neural TTS, which are free to download and work offline.
Cost at volume: $0 forever. But you get what you pay for — no voice cloning, limited language support, no emotion control, and no batch export without workarounds.
3. Self-Host Open-Source Models
Best for: Developers with GPU access, high-volume generation, privacy-sensitive workloads, and anyone who wants zero marginal cost per character.
Open-source TTS models have improved dramatically. Several now match cloud quality on standard benchmarks. After the initial hardware investment, each generated character costs nothing.
| Model | Params | VRAM | Voice Cloning | Languages | RTF (RTX 4090) | License |
|---|---|---|---|---|---|---|
| Kokoro | 82M | <1 GB | No | 8 | ~30x | Apache 2.0 |
| Chatterbox-Turbo | 350M | ~4 GB | Yes (10s clip) | 1 (23 multi) | ~6-8x | MIT |
| Qwen3-TTS 0.6B | 600M | ~3 GB | Yes (3s clip) | 10 | ~2-3x | Apache 2.0 |
| Qwen3-TTS 1.7B | 1.7B | ~5.4 GB | Yes (3s clip) | 10 | ~0.65-0.85x | Apache 2.0 |
| Coqui XTTS-v2 | 470M | ~2-4 GB | Yes (6s clip) | 17 | ~5x | MPL-2.0 |
| Fish Audio S2-Pro | Open | ~4-8 GB | Yes | 80+ | Fast (100ms TTFA) | Apache 2.0 |
| Orpheus | 3B | ~7+ GB | No | English | Moderate | Llama 3.2 Community |
Kokoro is the standout for low-resource self-hosting. At 82M parameters, it runs on CPU, a Raspberry Pi, or any GPU with <1 GB VRAM. Real-time factor of ~30x on a mid-range GPU means it can generate 30 seconds of audio per second of computation. No voice cloning, but for straightforward narration it is essentially free after hardware.
Chatterbox-Turbo is the engine powering Spokio. At 350M parameters with MIT license, it fits on a 4 GB GPU and supports voice cloning from 10-second audio clips. The 1.7B Qwen3-TTS model is slower but offers 3-second voice cloning across 10 languages.
Break-even analysis vs cloud APIs:
| Model | Hardware Cost | Monthly GPU Rental | Break-even vs ElevenLabs Flash | Break-even vs OpenAI TTS-1 |
|---|---|---|---|---|
| Kokoro (CPU) | $0 (existing hardware) | $0 | Immediate | Immediate |
| Kokoro (RTX 4060) | $300 | ~$60/mo | ~2,500 chars/day | ~4,000 chars/day |
| Qwen3-TTS (RTX 3090) | $1,600 (one-time) | ~$110/mo | ~29,000 utterances/mo | ~37,000 utterances/mo |
| Qwen3-TTS (L40S dedicated) | $6,000 | ~$619/mo | ~29,000 utterances/mo | Never (GPU costs more) |
Self-hosting breaks even fastest when you already own the hardware. If you have a desktop with a decent GPU, Kokoro or Chatterbox-Turbo can serve unlimited TTS at zero extra cost.
4. Cheap Per-Character APIs
Best for: Low-to-medium volume, multi-language needs, or when you need cloud-quality voices without per-minute billing.
Not every cloud API charges by the minute. Several major providers bill per character at rates that are dramatically cheaper than per-minute services.
| Provider | Model | $ per 1M chars | Free Tier |
|---|---|---|---|
| Google Gemini Flash TTS | Flash-Lite Preview | $0.50 | $300 GCP credits |
| OpenAI GPT-4o Mini TTS | TTS | $0.60 | 3 RPM / 200 RPD |
| Cloudflare MeloTTS | Workers AI | $0.29 | Yes (limited) |
| Amazon Polly Standard | Standard | $4.00 | 5M chars/mo (first 12 mo) |
| Google Cloud Standard | Standard | $4.00 | 4M chars/mo (forever) |
| xAI Grok TTS | Standard | $4.00 | — |
| Azure Neural TTS | Neural | $15.00 | 5M chars/mo (forever) |
| OpenAI TTS-1 | Standard | $15.00 | Rate-limited free tier |
| Fish Audio S2 Pro | Neural | $15.00 | ~7 min/mo free |
| Deepgram Aura-1 | TTS | $15.00 | $200 free credit |
| ElevenLabs Flash v2 | Fast | $8.00 | 10K chars/mo |
| Inworld TTS 1.5 Mini | Standard | $25.00 | Up to 40 min TTS free |
The cheapest options:
- Google Gemini Flash TTS at $0.50/1M chars is shockingly cheap. For 100 hours of audio (~5.4M chars), that is $2.70/month.
- Amazon Polly Standard at $4/1M chars is a workhorse. 5M chars free per month for the first year.
- Google Cloud Standard gives you 4M chars free forever, then $4/1M.
The trap to avoid: Per-minute APIs like ElevenLabs Agents ($0.09/min) and Deepgram Voice Agent ($0.075–0.163/min) sound cheap at low volume but scale linearly. At 10,000 minutes/month, ElevenLabs costs $900. The same volume via Google Standard ($4/1M chars at ~900 chars/min) costs ~$36.
| Provider | Cost for 1K min | Cost for 10K min | Cost for 100K min |
|---|---|---|---|
| ElevenLabs Agents ($0.09/min) | $90 | $900 | $9,000 |
| Deepgram Voice Agent ($0.075/min) | $75 | $750 | $7,500 |
| Google Standard ($4/1M chars) | ~$3.60 | ~$36 | ~$360 |
| OpenAI TTS-1 ($15/1M chars) | ~$13.50 | ~$135 | ~$1,350 |
| Azure Neural ($15/1M chars) | ~$13.50 | ~$135 | ~$1,350 |
| Azure Commitment ($7.50/1M at 2B) | ~$6.75 | ~$67.50 | ~$675 |
Per-character APIs are not a trap. Per-minute APIs are. The difference is transparency: per-character billing scales linearly and predictably. Per-minute billing often hides behind “included minutes” and “overage rates” that spike when you exceed your plan.
5. Flat-Rate Subscriptions
Best for: When you need cloud features (voice variety, team collaboration, API access) and want predictable monthly costs.
Flat-rate subscriptions are not the cheapest option, but they eliminate meter anxiety. You pay a fixed amount and generate as much as you need.
| Service | Price | Limits | Effective Per-Minute (at cap) |
|---|---|---|---|
| ElevenLabs Creator | $22/mo | 100K chars | ~$0.22/min |
| ElevenLabs Pro | $99/mo | 500K chars | ~$0.18/min |
| ElevenLabs Scale | $330/mo | 2M chars | ~$0.15/min |
| PlayHT Premium | $29/mo | 1M chars | ~$0.026/min |
| Murf Creator | $19/mo | 24 hrs/yr | ~$0.16/min |
Flat-rate plans work best when your usage is consistent and fits within the included allowance. The risk is overage — ElevenLabs charges $0.12–0.30 per 1K chars above your plan limit, which can quickly erase the cost advantage.
The subscription trap: Most per-minute APIs present themselves as subscriptions (ElevenLabs Creator is $22/mo “for 100K chars”). This looks like a flat rate. But go over by 10% and you pay overage. The per-minute pricing is concealed inside a subscription wrapper.
Decision Matrix: Which Approach at What Volume
| Monthly Volume | Best Approach | Monthly Cost | Why |
|---|---|---|---|
| < 10K chars | Built-in / free | $0 | No need to pay for anything at this volume. |
| 10K–100K chars | Cheap per-character API | $0.01–$0.40 | Google Gemini Flash ($0.50/1M) or Polly Standard ($4/1M) cost pennies. |
| 100K–1M chars | Per-character API or self-host | $0.40–$4.00 | Google/Amazon APIs stay cheap. Or self-host Kokoro if you have a machine. |
| 1M–10M chars | Self-host or one-time app | $0–$60 | Self-host Kokoro ($0 marginal) or Chatterbox on a mid GPU ($60/mo rental). |
| 10M–100M chars | Self-host dedicated | $110–$619 | Qwen3-TTS or Chatterbox on a dedicated GPU. Cost is flat regardless of volume. |
| 100M+ chars | Self-host enterprise | $619+ | Azure commitment ($7.50/1M) can also compete at this scale. |
Key insight: The one-time-purchase offline app (Spokio, Balabolka) wins at every volume bracket for its platform because the cost is $0 after purchase. The only reason to use anything else is if you need cross-platform support, specific cloud voices, or multi-language coverage the app does not provide.
Summary
The per-minute TTS pricing trap is easy to fall into because the first 15 minutes are free and the first thousand characters are cheap. But TTS scales. An app that generates 100 hours of audio monthly will pay $900–1,000/month on per-minute APIs. The same volume costs $36 via Google Standard, ~$60 via a self-hosted GPU, or $0 after a one-time app purchase.
Five ways to avoid the meter:
- Spokio — $49.99 lifetime, fully offline, voice cloning included, macOS.
- Built-in voices — macOS Spoken Content, Edge Read Aloud, Web Speech API. Free but limited.
- Self-host open-source — Kokoro (CPU), Chatterbox (4GB GPU), Qwen3-TTS. Zero marginal cost.
- Cheap per-character APIs — Google Gemini $0.50/1M, Polly $4/1M, Azure $7.50/1M (commitment).
- Flat-rate subscriptions — Predictable monthly cost, but watch for overage.
Pick the approach that matches your volume and platform. If you are on Mac and want to never think about TTS pricing again, the one-time purchase route is hard to beat.
For more comparisons, read Local TTS vs Cloud TTS: Which Is Better? and AI Voiceover Cost Comparison: Cloud Subscriptions vs Local TTS.
