How to Avoid Per-Minute TTS Pricing: A Developer's Guide to 2026

If you are building an app, service, or pipeline that generates speech at scale, per-minute TTS pricing is a trap you do not want to fall into.

ElevenLabs charges $0.09–0.10 per minute for its conversational AI agents. At 10,000 minutes per month (about 5.5 hours per day), that is $900–1,000. Deepgram’s voice agent API runs $0.075–0.163 per minute. PlayHT and Murf meter by characters with sharp overage rates. The costs add up fast when your app actually gets used.

The good news: you have better options. Here are five approaches to TTS that avoid the per-minute meter, ranked by effort-to-savings ratio.

1. One-Time Purchase Offline Apps

Best for: Individual developers, small teams, Mac or Windows users who want zero marginal cost with minimal setup.

A one-time-purchase offline app is the simplest escape from usage-based pricing. Buy once, generate forever. No meter, no characters counted, no minutes tracked.

App	Platform	Price	Voice Cloning	Export Formats
Spokio	macOS	Free + $49.99 lifetime Pro	Yes (free tier)	MP3, WAV, AIFF, M4A
Balabolka	Windows	Free	No	MP3, WAV, OGG, WMA
macOS Spoken Content	macOS	Free (built-in)	No	No built-in export

Spokio is a native macOS app that runs entirely offline. It is powered by Chatterbox Turbo and supports local voice cloning from short samples on the free plan. The lifetime Pro ($49.99) unlocks batch export, unlimited custom voices, and priority support — with no subscription and no per-character billing.

Balabolka is a free Windows application that uses your system’s SAPI voices. It reads PDFs, DOCX, EPUB, and more, and exports to audio files with no limits, ads, or subscriptions. The app is freeware; voice quality depends on what you have installed.

macOS Spoken Content is built into every Mac. Enable it in System Settings → Accessibility → Spoken Content. You get decent neural voices (downloadable for free) and system-wide text selection reading. No export built in, but for quick listening it is zero-cost and zero-setup.

Cost at volume: $0 after initial purchase. Year one amortized cost is under $0.001/min at 100 hours per month. After year one, it costs literally zero.

2. Free Built-In and Browser Voices

Best for: Zero-budget projects, prototyping, personal use, and apps where voice quality is not the priority.

Before committing to any paid TTS solution, check what you already have.

Source	Platform	Voice Quality	Offline	Export
Microsoft Edge Read Aloud	Windows, Mac, Linux, Android	Good (Microsoft Neural)	Partial	No (use `edge-tts` Python lib)
Web Speech API	Chrome, Edge, Firefox, Safari, Opera	OS-dependent	OS-dependent	No
macOS Spoken Content	macOS	Decent (Apple Neural)	Yes	No

Microsoft Edge Read Aloud uses Microsoft’s neural TTS voices, which are genuinely good — comparable to many paid services. It works on any platform Edge runs on. For developers, the edge-tts Python library (MIT license) accesses the same voices programmatically without requiring the browser or an API key.

Web Speech API is the simplest path to TTS in a browser app. A single line of JavaScript (window.speechSynthesis.speak(new SpeechSynthesisUtterance(text))) gives you speech on any modern browser. Quality depends entirely on the OS — Windows 11 neural voices are good, macOS voices are decent, Linux voices are robotic. No export, no voice cloning, no SSML in all implementations.

macOS Spoken Content is system-wide. Select text, press a keyboard shortcut, and the Mac reads it aloud. Voices are Apple’s neural TTS, which are free to download and work offline.

Cost at volume: $0 forever. But you get what you pay for — no voice cloning, limited language support, no emotion control, and no batch export without workarounds.

3. Self-Host Open-Source Models

Best for: Developers with GPU access, high-volume generation, privacy-sensitive workloads, and anyone who wants zero marginal cost per character.

Open-source TTS models have improved dramatically. Several now match cloud quality on standard benchmarks. After the initial hardware investment, each generated character costs nothing.

Model	Params	VRAM	Voice Cloning	Languages	RTF (RTX 4090)	License
Kokoro	82M	<1 GB	No	8	~30x	Apache 2.0
Chatterbox-Turbo	350M	~4 GB	Yes (10s clip)	1 (23 multi)	~6-8x	MIT
Qwen3-TTS 0.6B	600M	~3 GB	Yes (3s clip)	10	~2-3x	Apache 2.0
Qwen3-TTS 1.7B	1.7B	~5.4 GB	Yes (3s clip)	10	~0.65-0.85x	Apache 2.0
Coqui XTTS-v2	470M	~2-4 GB	Yes (6s clip)	17	~5x	MPL-2.0
Fish Audio S2-Pro	Open	~4-8 GB	Yes	80+	Fast (100ms TTFA)	Apache 2.0
Orpheus	3B	~7+ GB	No	English	Moderate	Llama 3.2 Community

Kokoro is the standout for low-resource self-hosting. At 82M parameters, it runs on CPU, a Raspberry Pi, or any GPU with <1 GB VRAM. Real-time factor of ~30x on a mid-range GPU means it can generate 30 seconds of audio per second of computation. No voice cloning, but for straightforward narration it is essentially free after hardware.

Chatterbox-Turbo is the engine powering Spokio. At 350M parameters with MIT license, it fits on a 4 GB GPU and supports voice cloning from 10-second audio clips. The 1.7B Qwen3-TTS model is slower but offers 3-second voice cloning across 10 languages.

Break-even analysis vs cloud APIs:

Model	Hardware Cost	Monthly GPU Rental	Break-even vs ElevenLabs Flash	Break-even vs OpenAI TTS-1
Kokoro (CPU)	$0 (existing hardware)	$0	Immediate	Immediate
Kokoro (RTX 4060)	$300	~$60/mo	~2,500 chars/day	~4,000 chars/day
Qwen3-TTS (RTX 3090)	$1,600 (one-time)	~$110/mo	~29,000 utterances/mo	~37,000 utterances/mo
Qwen3-TTS (L40S dedicated)	$6,000	~$619/mo	~29,000 utterances/mo	Never (GPU costs more)

Self-hosting breaks even fastest when you already own the hardware. If you have a desktop with a decent GPU, Kokoro or Chatterbox-Turbo can serve unlimited TTS at zero extra cost.

4. Cheap Per-Character APIs

Best for: Low-to-medium volume, multi-language needs, or when you need cloud-quality voices without per-minute billing.

Not every cloud API charges by the minute. Several major providers bill per character at rates that are dramatically cheaper than per-minute services.

Provider	Model	$ per 1M chars	Free Tier
Google Gemini Flash TTS	Flash-Lite Preview	$0.50	$300 GCP credits
OpenAI GPT-4o Mini TTS	TTS	$0.60	3 RPM / 200 RPD
Cloudflare MeloTTS	Workers AI	$0.29	Yes (limited)
Amazon Polly Standard	Standard	$4.00	5M chars/mo (first 12 mo)
Google Cloud Standard	Standard	$4.00	4M chars/mo (forever)
xAI Grok TTS	Standard	$4.00	—
Azure Neural TTS	Neural	$15.00	5M chars/mo (forever)
OpenAI TTS-1	Standard	$15.00	Rate-limited free tier
Fish Audio S2 Pro	Neural	$15.00	~7 min/mo free
Deepgram Aura-1	TTS	$15.00	$200 free credit
ElevenLabs Flash v2	Fast	$8.00	10K chars/mo
Inworld TTS 1.5 Mini	Standard	$25.00	Up to 40 min TTS free

The cheapest options:

Google Gemini Flash TTS at $0.50/1M chars is shockingly cheap. For 100 hours of audio (~5.4M chars), that is $2.70/month.
Amazon Polly Standard at $4/1M chars is a workhorse. 5M chars free per month for the first year.
Google Cloud Standard gives you 4M chars free forever, then $4/1M.

The trap to avoid: Per-minute APIs like ElevenLabs Agents ($0.09/min) and Deepgram Voice Agent ($0.075–0.163/min) sound cheap at low volume but scale linearly. At 10,000 minutes/month, ElevenLabs costs $900. The same volume via Google Standard ($4/1M chars at ~900 chars/min) costs ~$36.

Provider	Cost for 1K min	Cost for 10K min	Cost for 100K min
ElevenLabs Agents ($0.09/min)	$90	$900	$9,000
Deepgram Voice Agent ($0.075/min)	$75	$750	$7,500
Google Standard ($4/1M chars)	~$3.60	~$36	~$360
OpenAI TTS-1 ($15/1M chars)	~$13.50	~$135	~$1,350
Azure Neural ($15/1M chars)	~$13.50	~$135	~$1,350
Azure Commitment ($7.50/1M at 2B)	~$6.75	~$67.50	~$675

Per-character APIs are not a trap. Per-minute APIs are. The difference is transparency: per-character billing scales linearly and predictably. Per-minute billing often hides behind “included minutes” and “overage rates” that spike when you exceed your plan.

5. Flat-Rate Subscriptions

Best for: When you need cloud features (voice variety, team collaboration, API access) and want predictable monthly costs.

Flat-rate subscriptions are not the cheapest option, but they eliminate meter anxiety. You pay a fixed amount and generate as much as you need.

Service	Price	Limits	Effective Per-Minute (at cap)
ElevenLabs Creator	$22/mo	100K chars	~$0.22/min
ElevenLabs Pro	$99/mo	500K chars	~$0.18/min
ElevenLabs Scale	$330/mo	2M chars	~$0.15/min
PlayHT Premium	$29/mo	1M chars	~$0.026/min
Murf Creator	$19/mo	24 hrs/yr	~$0.16/min

Flat-rate plans work best when your usage is consistent and fits within the included allowance. The risk is overage — ElevenLabs charges $0.12–0.30 per 1K chars above your plan limit, which can quickly erase the cost advantage.

The subscription trap: Most per-minute APIs present themselves as subscriptions (ElevenLabs Creator is $22/mo “for 100K chars”). This looks like a flat rate. But go over by 10% and you pay overage. The per-minute pricing is concealed inside a subscription wrapper.

Decision Matrix: Which Approach at What Volume

Monthly Volume	Best Approach	Monthly Cost	Why
< 10K chars	Built-in / free	$0	No need to pay for anything at this volume.
10K–100K chars	Cheap per-character API	$0.01–$0.40	Google Gemini Flash ($0.50/1M) or Polly Standard ($4/1M) cost pennies.
100K–1M chars	Per-character API or self-host	$0.40–$4.00	Google/Amazon APIs stay cheap. Or self-host Kokoro if you have a machine.
1M–10M chars	Self-host or one-time app	$0–$60	Self-host Kokoro ($0 marginal) or Chatterbox on a mid GPU ($60/mo rental).
10M–100M chars	Self-host dedicated	$110–$619	Qwen3-TTS or Chatterbox on a dedicated GPU. Cost is flat regardless of volume.
100M+ chars	Self-host enterprise	$619+	Azure commitment ($7.50/1M) can also compete at this scale.

Key insight: The one-time-purchase offline app (Spokio, Balabolka) wins at every volume bracket for its platform because the cost is $0 after purchase. The only reason to use anything else is if you need cross-platform support, specific cloud voices, or multi-language coverage the app does not provide.

Summary

The per-minute TTS pricing trap is easy to fall into because the first 15 minutes are free and the first thousand characters are cheap. But TTS scales. An app that generates 100 hours of audio monthly will pay $900–1,000/month on per-minute APIs. The same volume costs $36 via Google Standard, ~$60 via a self-hosted GPU, or $0 after a one-time app purchase.

Five ways to avoid the meter:

Spokio — $49.99 lifetime, fully offline, voice cloning included, macOS.
Built-in voices — macOS Spoken Content, Edge Read Aloud, Web Speech API. Free but limited.
Self-host open-source — Kokoro (CPU), Chatterbox (4GB GPU), Qwen3-TTS. Zero marginal cost.
Cheap per-character APIs — Google Gemini $0.50/1M, Polly $4/1M, Azure $7.50/1M (commitment).
Flat-rate subscriptions — Predictable monthly cost, but watch for overage.

Pick the approach that matches your volume and platform. If you are on Mac and want to never think about TTS pricing again, the one-time purchase route is hard to beat.

For more comparisons, read Local TTS vs Cloud TTS: Which Is Better? and AI Voiceover Cost Comparison: Cloud Subscriptions vs Local TTS.

How to Avoid Per-Minute TTS Pricing: A Developer's Guide to 2026

1. One-Time Purchase Offline Apps

2. Free Built-In and Browser Voices

3. Self-Host Open-Source Models

4. Cheap Per-Character APIs

5. Flat-Rate Subscriptions

Decision Matrix: Which Approach at What Volume

Summary

More from the blog

Try Spokio for Mac.

Product

Features

Use Cases

Compare