Most developers reach for a cloud API when they need text-to-speech. It is the obvious path: one API key, one HTTP call, and audio comes back. But once you start building a real product around TTS — where users generate hours of speech daily, need offline support, or care about data privacy — the web-app-only approach breaks down.
Electron changes the calculation. With Electron, you ship a full Node.js runtime alongside your app, which means you can bundle ONNX-based TTS models directly into the installer, run inference on the user’s GPU, and never touch a network request for speech generation. The app works offline by default. There is no per-character API billing. Audio data never leaves the machine.
This guide covers the architecture decisions, code patterns, and performance tradeoffs involved in building a production-quality TTS desktop app with Electron.
Why Desktop Instead of Web
A web app can call a TTS API and serve audio back to the browser. This works for low-volume use cases. But the architectural limitations become apparent at scale:
Latency. Every API call adds network round-trip time on top of model inference. Even with streaming, the first audio byte is delayed by at least one TLS handshake and HTTP round trip. A bundled local model eliminates the network entirely, so first audio can arrive quickly on the same machine.
Cost. Cloud TTS APIs charge per character. At 100,000 characters per day (roughly 90 minutes of speech), the cost ranges from $1.50 to $16.00 daily depending on the provider. A bundled model runs entirely on the user’s hardware with zero marginal cost.
Privacy. Every text sent to a TTS API is processed on someone else’s server. For enterprise users handling legal documents, medical records, or internal communications, this is a hard blocker. Local inference means the text never leaves the process boundary.
Offline availability. Web apps stop working when the network drops. Desktop apps with local models work on airplanes, in remote locations, and in air-gapped environments.
Audio quality. Local models like Kokoro-82M now score 4.5 MOS on the TTS Arena — competitive with cloud providers. The gap between on-device and cloud TTS quality has nearly closed.
Electron bridges the gap between web development convenience and desktop-native performance. If you already have a React-based web TTS app, porting it to Electron and adding local inference is a surprisingly incremental engineering effort.
Electron Architecture for TTS
An Electron TTS app has two distinct runtime environments: the main process (Node.js) and the renderer process (Chromium). The TTS inference engine belongs in the main process for three reasons:
- File system access. Model weights are large files (500MB-2GB). The main process can read them with Node.js
fswithout sandbox restrictions. - Native bindings. ONNX Runtime is a native Node.js addon. It runs in the main process, not the renderer.
- Process isolation. If a model crashes or leaks memory, it takes down the main process, not the UI. The renderer can show an error state while the main process restarts.
The renderer handles everything UI-related: text input, voice selection, playback controls, and waveform visualization. It communicates with the main process through Electron’s IPC bridge.
┌─────────────────────────────┐ ┌──────────────────────────────┐
│ Renderer Process │ │ Main Process │
│ │ │ │
│ Text Editor │ │ Model Manager │
│ Voice Selector │ │ └─ ONNX Runtime Session │
│ Playback Controls │ │ └─ GPU Backend Detection │
│ Waveform Visualization │ │ └─ Audio Buffer Pool │
│ Progress Bar │ │ │
│ Text Preprocessor (SSML) │ │ Audio Pipeline │
│ │ │ └─ WAV Encoder │
│ │ │ │ └─ Audio Device Output │
│ │ IPC (contextBridge)│ │ │
│ ▼ │ │ File System │
│ ipcRenderer.invoke() ──────┼─────│──► ipcMain.handle() │
│ ipcRenderer.on() ◄─────────┼─────│──► event.sender.send() │
└─────────────────────────────┘ └──────────────────────────────┘IPC Channel Design
Define your IPC channels in a shared constants file:
// src/shared/ipc-channels.ts
export const IPC = {
TTS_GENERATE: 'tts:generate',
TTS_GENERATE_CHUNKED: 'tts:generate-chunked',
TTS_CANCEL: 'tts:cancel',
TTS_PROGRESS: 'tts:progress',
TTS_MODEL_STATUS: 'tts:model-status',
TTS_GPU_INFO: 'tts:gpu-info',
TTS_LOAD_MODEL: 'tts:load-model',
} as const;The renderer sends text through contextBridge-exposed APIs:
// src/preload/index.ts
import { contextBridge, ipcRenderer } from 'electron';
contextBridge.exposeInMainWorld('tts', {
generate: (text: string, voice: string) =>
ipcRenderer.invoke(IPC.TTS_GENERATE, { text, voice }),
generateChunked: (text: string, voice: string) => {
ipcRenderer.invoke(IPC.TTS_GENERATE_CHUNKED, { text, voice });
// Listen for progress
const cleanup = (callback: (progress: number) => void) => {
const handler = (_event: any, progress: number) => callback(progress);
ipcRenderer.on(IPC.TTS_PROGRESS, handler);
return () => ipcRenderer.removeListener(IPC.TTS_PROGRESS, handler);
};
return { onProgress: cleanup };
},
cancel: () => ipcRenderer.invoke(IPC.TTS_CANCEL),
getGpuInfo: () => ipcRenderer.invoke(IPC.TTS_GPU_INFO),
loadModel: (modelId: string) =>
ipcRenderer.invoke(IPC.TTS_LOAD_MODEL, { modelId }),
});On the main process side:
// src/main/ipc-handlers.ts
import { ipcMain, BrowserWindow } from 'electron';
import { TtsEngine } from './tts-engine';
import { IPC } from '../shared/ipc-channels';
export function registerIpcHandlers(engine: TtsEngine) {
ipcMain.handle(IPC.TTS_GENERATE, async (_event, { text, voice }) => {
return engine.generate(text, voice);
});
ipcMain.handle(IPC.TTS_GENERATE_CHUNKED, async (event, { text, voice }) => {
const win = BrowserWindow.fromWebContents(event.sender);
await engine.generateChunked(text, voice, (progress) => {
win?.webContents.send(IPC.TTS_PROGRESS, progress);
});
});
ipcMain.handle(IPC.TTS_CANCEL, () => {
engine.cancel();
});
ipcMain.handle(IPC.TTS_GPU_INFO, () => {
return engine.detectGpuBackend();
});
ipcMain.handle(IPC.TTS_LOAD_MODEL, async (_event, { modelId }) => {
await engine.loadModel(modelId);
});
}The renderer never touches the model directly. It sends strings, receives audio buffers, and updates the UI. This clean separation makes it straightforward to swap the inference backend without changing a single line of UI code.
Model Bundling Strategies
When shipping a TTS model with an Electron app, you have two strategies with very different tradeoffs.
Strategy 1: Bundle in the Installer
Ship the ONNX model inside extraResources in electron-builder:
# electron-builder.yml
extraResources:
- from: models/
to: models/
filter:
- "**/*.onnx"
- "**/*.json"Access the model path at runtime:
const modelPath = path.join(process.resourcesPath, 'models', 'kokoro-82m.onnx');Pros:
- Zero network dependency. The app is fully functional offline on first launch.
- Predictable install size. Users know what they are downloading.
- No post-install download failures.
Cons:
- Large installer. Kokoro-82M in ONNX format is approximately 320MB (float32) or 160MB (float16). A multi-model app with 3-4 voices can add 1GB+ to the installer.
- Hard to update models without shipping a new app version.
- Users pay the download cost even if they never use certain voices.
Strategy 2: Download on First Run
Ship a minimal installer and download model weights on first launch:
import { createWriteStream, existsSync } from 'fs';
import { pipeline } from 'stream/promises';
import { Readable } from 'stream';
async function ensureModel(modelId: string): Promise<string> {
const modelDir = path.join(app.getPath('userData'), 'models', modelId);
const modelFile = path.join(modelDir, 'model.onnx');
if (existsSync(modelFile)) {
return modelFile;
}
// Notify renderer of download start
mainWindow?.webContents.send(IPC.TTS_PROGRESS, { phase: 'downloading', progress: 0 });
const response = await fetch(`https://models.example.com/${modelId}/model.onnx`);
const totalSize = parseInt(response.headers.get('content-length') || '0', 10);
let downloadedSize = 0;
await fs.mkdir(modelDir, { recursive: true });
await pipeline(
response.body!,
new Transform({
transform(chunk, encoding, callback) {
downloadedSize += chunk.length;
const progress = totalSize ? downloadedSize / totalSize : 0;
mainWindow?.webContents.send(IPC.TTS_PROGRESS, { phase: 'downloading', progress });
this.push(chunk);
callback();
},
}),
createWriteStream(modelFile)
);
return modelFile;
}Pros:
- Small initial download (app shell is ~100MB).
- Users only download the models they need.
- Models can be updated independently of the app binary.
Cons:
- Requires internet on first launch (or first voice selection).
- Download failures or slow connections create a poor first impression.
- UserData directory can grow large without explicit management.
Recommended Approach: Hybrid
Ship one small, fast model (Kokoro-82M at ~160MB in float16 ONNX) in the installer as the default voice. Allow downloading additional voices on demand. Show available voice sizes in the UI so users know what to expect:
interface VoiceModel {
id: string;
name: string;
sizeMb: number;
bundled: boolean;
language: string;
}
const voices: VoiceModel[] = [
{ id: 'kokoro-default', name: 'Default (English)', sizeMb: 160, bundled: true, language: 'en' },
{ id: 'kokoro-female-1', name: 'Voice Pack 1 (English)', sizeMb: 160, bundled: false, language: 'en' },
{ id: 'qwen3-tts-0.6b', name: 'Qwen3 TTS Multilingual', sizeMb: 700, bundled: false, language: 'multi' },
];This pattern is what you see in apps like OBS, Figma Desktop, and VS Code — ship a functional core, download optional extras.
GPU Detection and Acceleration
ONNX Runtime supports multiple execution providers (EPs) depending on the user’s hardware. The challenge is detecting which provider is available at runtime and selecting the best one.
Available Execution Providers
| Provider | Platform | GPU | Notes |
|---|---|---|---|
cpu |
All | No | Always available, baseline fallback |
cuda |
Windows/Linux | NVIDIA GPU | Requires CUDA 11.x+ and cuDNN |
tensorrt |
Windows/Linux | NVIDIA GPU | Faster than plain CUDA for some models |
directml |
Windows | Any DirectX 12 GPU | Works on AMD, Intel, NVIDIA |
coreml |
macOS | Apple GPU | M-series optimized |
xnnpack |
All | No | CPU, but much faster than default CPU EP |
Runtime GPU Detection
// src/main/gpu-detector.ts
import * as ort from 'onnxruntime-node';
export interface GpuInfo {
available: boolean;
backend: 'cuda' | 'directml' | 'coreml' | 'cpu';
deviceName?: string;
vramMb?: number;
}
export async function detectBestBackend(): Promise<GpuInfo> {
// Try providers in priority order
const backends = [
{ name: 'cuda' as const, provider: 'CUDAExecutionProvider' },
{ name: 'coreml' as const, provider: 'CoreMLExecutionProvider' },
{ name: 'directml' as const, provider: 'DmlExecutionProvider' },
];
for (const backend of backends) {
try {
const info = ort.env.debug?.providerInfo?.[backend.provider];
// ONNX Runtime's native provider detection
const session = await ort.InferenceSession.create(
Buffer.from(onnxDummyWeights), // Tiny test model
{ executionProviders: [backend.provider] }
);
return {
available: true,
backend: backend.name,
deviceName: info?.device ?? undefined,
vramMb: info?.memory ?? undefined,
};
} catch {
// Provider not available, try next
continue;
}
}
return { available: false, backend: 'cpu' };
}Provider-Specific Optimizations
CUDA (NVIDIA): Set sessionOptions.graphOptimizationLevel to ort.GraphOptimizationLevel.LEVEL3 and enable cudnnConvAlgoSearch for best performance:
const session = await ort.InferenceSession.create(modelPath, {
executionProviders: [{
name: 'cuda',
deviceId: 0,
cudnnConvAlgoSearch: 'EXHAUSTIVE',
prefer_nhwc: true,
}],
graphOptimizationLevel: 'all',
executionMode: 'parallel',
});CoreML (Apple Silicon): CoreML EP on M-series Macs uses the ANE (Apple Neural Engine) where possible. Set enableOnSubgraphs: true to allow the EP to accelerate compatible subgraphs while falling back to CPU for unsupported ops:
const session = await ort.InferenceSession.create(modelPath, {
executionProviders: [{
name: 'coreml',
enableOnSubgraphs: true,
computeUnits: 'ALL', // Uses Neural Engine + GPU + CPU
}],
});DirectML (Windows): DirectML works with any DirectX 12 GPU. The disableMetacommands option can improve stability on older AMD drivers:
const session = await ort.InferenceSession.create(modelPath, {
executionProviders: [{
name: 'dml',
deviceId: 0,
disableMetacommands: false,
}],
});Resource Cleanup
GPU memory leaks are a real concern in long-running desktop apps. Expose cleanup through the IPC bridge:
// Force release GPU resources
async function releaseGpuSession(): Promise<void> {
if (currentSession) {
await currentSession.release();
currentSession = null;
}
// Force garbage collection hint
if (global.gc) {
global.gc();
}
}Include --expose-gc in the main process startup flags via app.commandLine.appendSwitch.
Audio Playback Pipeline
Once the model generates raw float32 audio samples, the pipeline needs to encode them into a playable format and route them to the audio device.
WAV Encoding
The simplest portable format is WAV — PCM 16-bit, mono, 24kHz (matching Kokoro’s native rate):
// src/main/wav-encoder.ts
export function encodeWav(samples: Float32Array, sampleRate: number = 24000): Buffer {
const numChannels = 1;
const bitsPerSample = 16;
const byteRate = sampleRate * numChannels * (bitsPerSample / 8);
const blockAlign = numChannels * (bitsPerSample / 8);
const dataSize = samples.length * (bitsPerSample / 8);
const buffer = Buffer.alloc(44 + dataSize);
// RIFF header
buffer.write('RIFF', 0);
buffer.writeUInt32LE(36 + dataSize, 4);
buffer.write('WAVE', 8);
// fmt chunk
buffer.write('fmt ', 12);
buffer.writeUInt32LE(16, 16); // chunk size
buffer.writeUInt16LE(1, 20); // PCM format
buffer.writeUInt16LE(numChannels, 22);
buffer.writeUInt32LE(sampleRate, 24);
buffer.writeUInt32LE(byteRate, 28);
buffer.writeUInt16LE(blockAlign, 32);
buffer.writeUInt16LE(bitsPerSample, 34);
// data chunk
buffer.write('data', 36);
buffer.writeUInt32LE(dataSize, 40);
// Convert float32 [-1, 1] to int16
for (let i = 0; i < samples.length; i++) {
const sample = Math.max(-1, Math.min(1, samples[i]));
const intSample = sample < 0 ? sample * 0x8000 : sample * 0x7FFF;
buffer.writeInt16LE(intSample, 44 + i * 2);
}
return buffer;
}Audio Output
For audio output in the main process, use node-speaker (a native binding to libsoundio) or write to a WAV buffer and play it from the renderer via the Web Audio API:
// src/main/audio-player.ts
import { Speaker } from 'node-speaker';
export class AudioPlayer {
private speaker: Speaker | null = null;
play(samples: Float32Array, sampleRate: number = 24000): Promise<void> {
return new Promise((resolve, reject) => {
const wavBuffer = encodeWav(samples, sampleRate);
this.speaker = new Speaker({
channels: 1,
bitDepth: 16,
sampleRate,
signed: true,
});
this.speaker.on('close', resolve);
this.speaker.on('error', reject);
this.speaker.write(wavBuffer.slice(44)); // Skip WAV header
this.speaker.end();
});
}
stop(): void {
if (this.speaker) {
this.speaker.close();
this.speaker = null;
}
}
}Alternatively, send the WAV buffer to the renderer and use the Web Audio API for playback. This gives the UI full control over seeking, looping, and visualization:
// Renderer side (React)
function AudioPlayer({ audioBuffer }: { audioBuffer: ArrayBuffer }) {
const audioContextRef = useRef<AudioContext | null>(null);
useEffect(() => {
const ctx = new AudioContext();
const source = ctx.createBufferSource();
ctx.decodeAudioData(audioBuffer, (buffer) => {
source.buffer = buffer;
source.connect(ctx.destination);
source.start(0);
});
audioContextRef.current = ctx;
return () => { ctx.close(); };
}, [audioBuffer]);
return null;
}The Web Audio API approach is better for most apps because it enables visualization (waveform, spectrogram), volume control, and seeking without additional IPC calls.
Chunked Generation for Long Text
TTS models have a maximum input token length. For Kokoro-82M, the limit is approximately 512 tokens (~400 words). Exceeding this produces truncated or degraded audio.
Chunked generation splits long text into segments, generates audio for each, and concatenates the results. The renderer shows a progress bar while the main process works through chunks.
Sentence-Aware Splitting
// src/main/text-chunker.ts
function splitIntoSentences(text: string): string[] {
// Preserve sentence boundaries while keeping delimiter attached
const sentenceRegex = /[^.!?]+[.!?]+/g;
const matches = text.match(sentenceRegex);
if (!matches) {
// Fallback: split by character count
return splitByCharLimit(text, 400);
}
// Merge short sentences to avoid too-small chunks
const merged: string[] = [];
let current = '';
for (const sentence of matches) {
if ((current + sentence).split(/\s+/).length > 400) {
if (current) merged.push(current.trim());
current = sentence;
} else {
current += sentence;
}
}
if (current) merged.push(current.trim());
return merged.length > 0 ? merged : splitByCharLimit(text, 400);
}
function splitByCharLimit(text: string, limit: number): string[] {
const chunks: string[] = [];
for (let i = 0; i < text.length; i += limit) {
chunks.push(text.slice(i, i + limit));
}
return chunks;
}Chunked Generation with Progress
// src/main/tts-engine.ts
export class TtsEngine {
private currentSession: ort.InferenceSession | null = null;
private cancelled = false;
async *generateChunked(
text: string,
voice: string
): AsyncGenerator<{ audio: Float32Array; progress: number; chunkIndex: number }> {
const chunks = splitIntoSentences(text);
this.cancelled = false;
for (let i = 0; i < chunks.length; i++) {
if (this.cancelled) break;
const audio = await this.generateChunk(chunks[i], voice);
const progress = (i + 1) / chunks.length;
yield { audio, progress, chunkIndex: i };
}
}
private async generateChunk(text: string, voice: string): Promise<Float32Array> {
// Tokenize and run ONNX inference
const inputIds = this.tokenize(text);
const feeds = {
'input_ids': new ort.Tensor('int64', inputIds, [1, inputIds.length]),
'voice_id': new ort.Tensor('int64', [this.voiceIds[voice]], [1]),
};
const results = await this.currentSession!.run(feeds);
return results['audio'].data as Float32Array;
}
cancel(): void {
this.cancelled = true;
}
}Renderer Progress Handling
// Renderer (React)
function TtsGenerator() {
const [progress, setProgress] = useState(0);
const [status, setStatus] = useState<'idle' | 'generating' | 'done'>('idle');
async function handleGenerate(text: string) {
setStatus('generating');
setProgress(0);
const { onProgress } = window.tts.generateChunked(text, 'af_heart');
onProgress((p: number) => {
setProgress(Math.round(p * 100));
});
setStatus('done');
}
return (
<div>
{status === 'generating' && (
<progress value={progress} max={100} />
)}
</div>
);
}The key insight: chunked generation does not block the UI because ONNX inference runs in the main process, not the renderer. The event loop continues processing IPC messages even during long generations.
Text Preprocessing in the Renderer
Raw user input is rarely clean enough for direct TTS inference. The renderer handles preprocessing so the main process receives normalized text.
SSML Stripping
If you accept SSML input, strip it in the renderer before sending to the model:
// src/renderer/text-processor.ts
export function stripSsml(input: string): string {
return input
.replace(/<[^>]*>/g, '') // Remove all tags
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, "'")
.replace(/\s+/g, ' ') // Collapse whitespace
.trim();
}Number and Abbreviation Expansion
TTS models handle numbers and abbreviations inconsistently. Normalizing them in the renderer gives consistent results:
const numberMap: Record<string, string> = {
'0': 'zero', '1': 'one', '2': 'two', '3': 'three', '4': 'four',
'5': 'five', '6': 'six', '7': 'seven', '8': 'eight', '9': 'nine',
'10': 'ten',
// ... extend as needed
};
export function normalizeNumbers(text: string): string {
return text.replace(/\b\d+\b/g, (match) => {
// Keep large numbers as digits (they sound better)
if (match.length > 3) return match;
// Spell out small numbers
return numberMap[match] || match;
});
}Contraction Expansion
const contractions: Record<string, string> = {
"don't": 'do not',
"can't": 'can not',
"won't": 'will not',
"it's": 'it is',
"I'm": 'I am',
"you're": 'you are',
"they're": 'they are',
"we're": 'we are',
"that's": 'that is',
"there's": 'there is',
"what's": 'what is',
"let's": 'let us',
};
export function expandContractions(text: string): string {
let result = text;
for (const [contracted, expanded] of Object.entries(contractions)) {
const regex = new RegExp(`\\b${contracted}\\b`, 'gi');
result = result.replace(regex, expanded);
}
return result;
}Full Preprocessing Pipeline
export function preprocessText(input: string): string {
let text = input;
// Order matters
text = stripSsml(text);
text = normalizeUnicode(text); // Smart quotes → straight quotes
text = expandContractions(text);
text = normalizeNumbers(text);
text = text.replace(/[–—]/g, '—'); // Em dash normalization
text = text.replace(/\n{3,}/g, '\n\n'); // Collapse excessive newlines
return text.trim();
}Keep preprocessing lightweight since it runs synchronously in the renderer. Heavy NLP pipelines belong in the main process or a worker thread.
Performance: Bundled Model vs API Call
Example measurements from an M4 Mac mini (24GB) vs a leading cloud TTS API:
| Scenario | Local Model (Kokoro-82M, CoreML) | Cloud API (us-east-1) | Cloud API (ap-southeast-1) |
|---|---|---|---|
| First audio byte (100 chars) | ~85ms | ~320ms | ~480ms |
| Generate 500 words | ~0.9s | ~1.2s | ~1.8s |
| Generate 5000 words (chunked) | ~9s | ~12s | ~17s |
| Cost per 1M chars | $0 | $0.15-$1.60 | $0.15-$1.60 |
| Offline capable | Yes | No | No |
| Privacy | Complete | None | None |
| Install size impact | +160MB | 0MB | 0MB |
In this setup, local inference was faster than the cloud API on every metric except installer size. The gap can widen for users far from cloud provider regions, where network latency adds directly to the wait for first audio.
The one area where cloud APIs still excel: voice variety. Cloud providers have dozens of voices trained on proprietary datasets. Local models typically ship 2-8 voice presets. If your app requires 50+ distinct voices, cloud may be easier — though voice cloning models are closing this gap quickly.
Common Pitfalls
ASAR and Native Modules
ONNX Runtime is a native Node.js addon. It must be in the node_modules of the unpacked app, not inside the ASAR archive. Configure electron-builder to unpack it:
electron-builder.yml
asar: true
asarUnpack:
- "node_modules/onnxruntime-node/**"
- "models/**"Model Quantization
Float32 ONNX models use 4 bytes per weight. Float16 uses 2 bytes at minimal quality loss. INT8 quantization can reduce size by 4x but requires calibration data. For Electron apps, float16 is the best tradeoff: half the size of float32, negligible quality difference, and broad hardware support.
# Convert to float16 ONNX
python -m onnxruntime.quantization --input model_fp32.onnx \
--output model_fp16.onnx --quantize_mode float16Process Architecture
Do not run the model in the renderer. The renderer is a Chromium process with limited memory and no native addon support for ONNX Runtime. If you try, you get obscure V8 errors about missing native bindings.
Audio Device Selection
On macOS, CoreAudio handles default device routing automatically. On Windows, you may need to let users select their output device. Use win-audio or enumerate devices through navigator.mediaDevices.enumerateDevices() in the renderer.
Comparison: Spokio (Native) vs Electron
If you are building a Mac-only TTS app, Electron adds unnecessary overhead. The Electron shell itself consumes approximately 150-250MB of RAM before any model is loaded. Native Swift apps on Apple Silicon use 10-20MB for the same UI.
Spokio is a native macOS TTS application powered by Chatterbox Turbo and built without Electron. It runs offline on Apple Silicon and Intel Macs, supports local voice cloning from short samples, and avoids cloud uploads for text, audio, or voice samples. The native approach eliminates Chromium runtime overhead while integrating more directly with macOS audio and file export workflows.
The Electron approach makes sense when you need cross-platform deployment (Windows, Linux, macOS) from a single codebase. If your target is exclusively macOS, native Swift is lighter, faster, and integrates more deeply with system audio routing and accessibility APIs.
