tts pricingcloud ttslocal ttsdeveloperscost comparisonself-hosted tts

How to Avoid Per-Minute TTS Pricing: A Developer's Guide to 2026

Per-minute and per-character TTS APIs can cost thousands of dollars at scale. Here are five ways developers can avoid the meter — from one-time-purchase apps and self-hosted models to cheaper APIs and built-in browser voices.

Published on Jun 08, 202612 min read

If you are building an app, service, or pipeline that generates speech at scale, per-minute TTS pricing is a trap you do not want to fall into.

ElevenLabs charges $0.09–0.10 per minute for its conversational AI agents. At 10,000 minutes per month (about 5.5 hours per day), that is $900–1,000. Deepgram’s voice agent API runs $0.075–0.163 per minute. PlayHT and Murf meter by characters with sharp overage rates. The costs add up fast when your app actually gets used.

The good news: you have better options. Here are five approaches to TTS that avoid the per-minute meter, ranked by effort-to-savings ratio.


1. One-Time Purchase Offline Apps

Best for: Individual developers, small teams, Mac or Windows users who want zero marginal cost with minimal setup.

A one-time-purchase offline app is the simplest escape from usage-based pricing. Buy once, generate forever. No meter, no characters counted, no minutes tracked.

App Platform Price Voice Cloning Export Formats
Spokio macOS Free + $49.99 lifetime Pro Yes (free tier) MP3, WAV, AIFF, M4A
Balabolka Windows Free No MP3, WAV, OGG, WMA
macOS Spoken Content macOS Free (built-in) No No built-in export

Spokio is a native macOS app that runs entirely offline. It is powered by Chatterbox Turbo and supports local voice cloning from short samples on the free plan. The lifetime Pro ($49.99) unlocks batch export, unlimited custom voices, and priority support — with no subscription and no per-character billing.

Balabolka is a free Windows application that uses your system’s SAPI voices. It reads PDFs, DOCX, EPUB, and more, and exports to audio files with no limits, ads, or subscriptions. The app is freeware; voice quality depends on what you have installed.

macOS Spoken Content is built into every Mac. Enable it in System Settings → Accessibility → Spoken Content. You get decent neural voices (downloadable for free) and system-wide text selection reading. No export built in, but for quick listening it is zero-cost and zero-setup.

Cost at volume: $0 after initial purchase. Year one amortized cost is under $0.001/min at 100 hours per month. After year one, it costs literally zero.


2. Free Built-In and Browser Voices

Best for: Zero-budget projects, prototyping, personal use, and apps where voice quality is not the priority.

Before committing to any paid TTS solution, check what you already have.

Source Platform Voice Quality Offline Export
Microsoft Edge Read Aloud Windows, Mac, Linux, Android Good (Microsoft Neural) Partial No (use edge-tts Python lib)
Web Speech API Chrome, Edge, Firefox, Safari, Opera OS-dependent OS-dependent No
macOS Spoken Content macOS Decent (Apple Neural) Yes No

Microsoft Edge Read Aloud uses Microsoft’s neural TTS voices, which are genuinely good — comparable to many paid services. It works on any platform Edge runs on. For developers, the edge-tts Python library (MIT license) accesses the same voices programmatically without requiring the browser or an API key.

Web Speech API is the simplest path to TTS in a browser app. A single line of JavaScript (window.speechSynthesis.speak(new SpeechSynthesisUtterance(text))) gives you speech on any modern browser. Quality depends entirely on the OS — Windows 11 neural voices are good, macOS voices are decent, Linux voices are robotic. No export, no voice cloning, no SSML in all implementations.

macOS Spoken Content is system-wide. Select text, press a keyboard shortcut, and the Mac reads it aloud. Voices are Apple’s neural TTS, which are free to download and work offline.

Cost at volume: $0 forever. But you get what you pay for — no voice cloning, limited language support, no emotion control, and no batch export without workarounds.


3. Self-Host Open-Source Models

Best for: Developers with GPU access, high-volume generation, privacy-sensitive workloads, and anyone who wants zero marginal cost per character.

Open-source TTS models have improved dramatically. Several now match cloud quality on standard benchmarks. After the initial hardware investment, each generated character costs nothing.

Model Params VRAM Voice Cloning Languages RTF (RTX 4090) License
Kokoro 82M <1 GB No 8 ~30x Apache 2.0
Chatterbox-Turbo 350M ~4 GB Yes (10s clip) 1 (23 multi) ~6-8x MIT
Qwen3-TTS 0.6B 600M ~3 GB Yes (3s clip) 10 ~2-3x Apache 2.0
Qwen3-TTS 1.7B 1.7B ~5.4 GB Yes (3s clip) 10 ~0.65-0.85x Apache 2.0
Coqui XTTS-v2 470M ~2-4 GB Yes (6s clip) 17 ~5x MPL-2.0
Fish Audio S2-Pro Open ~4-8 GB Yes 80+ Fast (100ms TTFA) Apache 2.0
Orpheus 3B ~7+ GB No English Moderate Llama 3.2 Community

Kokoro is the standout for low-resource self-hosting. At 82M parameters, it runs on CPU, a Raspberry Pi, or any GPU with <1 GB VRAM. Real-time factor of ~30x on a mid-range GPU means it can generate 30 seconds of audio per second of computation. No voice cloning, but for straightforward narration it is essentially free after hardware.

Chatterbox-Turbo is the engine powering Spokio. At 350M parameters with MIT license, it fits on a 4 GB GPU and supports voice cloning from 10-second audio clips. The 1.7B Qwen3-TTS model is slower but offers 3-second voice cloning across 10 languages.

Break-even analysis vs cloud APIs:

Model Hardware Cost Monthly GPU Rental Break-even vs ElevenLabs Flash Break-even vs OpenAI TTS-1
Kokoro (CPU) $0 (existing hardware) $0 Immediate Immediate
Kokoro (RTX 4060) $300 ~$60/mo ~2,500 chars/day ~4,000 chars/day
Qwen3-TTS (RTX 3090) $1,600 (one-time) ~$110/mo ~29,000 utterances/mo ~37,000 utterances/mo
Qwen3-TTS (L40S dedicated) $6,000 ~$619/mo ~29,000 utterances/mo Never (GPU costs more)

Self-hosting breaks even fastest when you already own the hardware. If you have a desktop with a decent GPU, Kokoro or Chatterbox-Turbo can serve unlimited TTS at zero extra cost.


4. Cheap Per-Character APIs

Best for: Low-to-medium volume, multi-language needs, or when you need cloud-quality voices without per-minute billing.

Not every cloud API charges by the minute. Several major providers bill per character at rates that are dramatically cheaper than per-minute services.

Provider Model $ per 1M chars Free Tier
Google Gemini Flash TTS Flash-Lite Preview $0.50 $300 GCP credits
OpenAI GPT-4o Mini TTS TTS $0.60 3 RPM / 200 RPD
Cloudflare MeloTTS Workers AI $0.29 Yes (limited)
Amazon Polly Standard Standard $4.00 5M chars/mo (first 12 mo)
Google Cloud Standard Standard $4.00 4M chars/mo (forever)
xAI Grok TTS Standard $4.00
Azure Neural TTS Neural $15.00 5M chars/mo (forever)
OpenAI TTS-1 Standard $15.00 Rate-limited free tier
Fish Audio S2 Pro Neural $15.00 ~7 min/mo free
Deepgram Aura-1 TTS $15.00 $200 free credit
ElevenLabs Flash v2 Fast $8.00 10K chars/mo
Inworld TTS 1.5 Mini Standard $25.00 Up to 40 min TTS free

The cheapest options:

  • Google Gemini Flash TTS at $0.50/1M chars is shockingly cheap. For 100 hours of audio (~5.4M chars), that is $2.70/month.
  • Amazon Polly Standard at $4/1M chars is a workhorse. 5M chars free per month for the first year.
  • Google Cloud Standard gives you 4M chars free forever, then $4/1M.

The trap to avoid: Per-minute APIs like ElevenLabs Agents ($0.09/min) and Deepgram Voice Agent ($0.075–0.163/min) sound cheap at low volume but scale linearly. At 10,000 minutes/month, ElevenLabs costs $900. The same volume via Google Standard ($4/1M chars at ~900 chars/min) costs ~$36.

Provider Cost for 1K min Cost for 10K min Cost for 100K min
ElevenLabs Agents ($0.09/min) $90 $900 $9,000
Deepgram Voice Agent ($0.075/min) $75 $750 $7,500
Google Standard ($4/1M chars) ~$3.60 ~$36 ~$360
OpenAI TTS-1 ($15/1M chars) ~$13.50 ~$135 ~$1,350
Azure Neural ($15/1M chars) ~$13.50 ~$135 ~$1,350
Azure Commitment ($7.50/1M at 2B) ~$6.75 ~$67.50 ~$675

Per-character APIs are not a trap. Per-minute APIs are. The difference is transparency: per-character billing scales linearly and predictably. Per-minute billing often hides behind “included minutes” and “overage rates” that spike when you exceed your plan.


5. Flat-Rate Subscriptions

Best for: When you need cloud features (voice variety, team collaboration, API access) and want predictable monthly costs.

Flat-rate subscriptions are not the cheapest option, but they eliminate meter anxiety. You pay a fixed amount and generate as much as you need.

Service Price Limits Effective Per-Minute (at cap)
ElevenLabs Creator $22/mo 100K chars ~$0.22/min
ElevenLabs Pro $99/mo 500K chars ~$0.18/min
ElevenLabs Scale $330/mo 2M chars ~$0.15/min
PlayHT Premium $29/mo 1M chars ~$0.026/min
Murf Creator $19/mo 24 hrs/yr ~$0.16/min

Flat-rate plans work best when your usage is consistent and fits within the included allowance. The risk is overage — ElevenLabs charges $0.12–0.30 per 1K chars above your plan limit, which can quickly erase the cost advantage.

The subscription trap: Most per-minute APIs present themselves as subscriptions (ElevenLabs Creator is $22/mo “for 100K chars”). This looks like a flat rate. But go over by 10% and you pay overage. The per-minute pricing is concealed inside a subscription wrapper.


Decision Matrix: Which Approach at What Volume

Monthly Volume Best Approach Monthly Cost Why
< 10K chars Built-in / free $0 No need to pay for anything at this volume.
10K–100K chars Cheap per-character API $0.01–$0.40 Google Gemini Flash ($0.50/1M) or Polly Standard ($4/1M) cost pennies.
100K–1M chars Per-character API or self-host $0.40–$4.00 Google/Amazon APIs stay cheap. Or self-host Kokoro if you have a machine.
1M–10M chars Self-host or one-time app $0–$60 Self-host Kokoro ($0 marginal) or Chatterbox on a mid GPU ($60/mo rental).
10M–100M chars Self-host dedicated $110–$619 Qwen3-TTS or Chatterbox on a dedicated GPU. Cost is flat regardless of volume.
100M+ chars Self-host enterprise $619+ Azure commitment ($7.50/1M) can also compete at this scale.

Key insight: The one-time-purchase offline app (Spokio, Balabolka) wins at every volume bracket for its platform because the cost is $0 after purchase. The only reason to use anything else is if you need cross-platform support, specific cloud voices, or multi-language coverage the app does not provide.


Summary

The per-minute TTS pricing trap is easy to fall into because the first 15 minutes are free and the first thousand characters are cheap. But TTS scales. An app that generates 100 hours of audio monthly will pay $900–1,000/month on per-minute APIs. The same volume costs $36 via Google Standard, ~$60 via a self-hosted GPU, or $0 after a one-time app purchase.

Five ways to avoid the meter:

  1. Spokio — $49.99 lifetime, fully offline, voice cloning included, macOS.
  2. Built-in voices — macOS Spoken Content, Edge Read Aloud, Web Speech API. Free but limited.
  3. Self-host open-source — Kokoro (CPU), Chatterbox (4GB GPU), Qwen3-TTS. Zero marginal cost.
  4. Cheap per-character APIs — Google Gemini $0.50/1M, Polly $4/1M, Azure $7.50/1M (commitment).
  5. Flat-rate subscriptions — Predictable monthly cost, but watch for overage.

Pick the approach that matches your volume and platform. If you are on Mac and want to never think about TTS pricing again, the one-time purchase route is hard to beat.


For more comparisons, read Local TTS vs Cloud TTS: Which Is Better? and AI Voiceover Cost Comparison: Cloud Subscriptions vs Local TTS.

More from the blog