This guide walks through a complete Kokoro TTS voiceover workflow on a Mac — from installation to a finished set of audio files ready to drop into your video editor. After setup, generation runs locally without API keys, cloud uploads, or a subscription service.
What you will have by the end
A Python script that takes a text file and outputs named WAV files per section, plus MP3 copies ready for YouTube. Run it once per video, or wire it into a batch folder workflow.
Before you start
- A Mac (Intel or Apple Silicon — both work)
- Internet for the one-time install
- About 10 minutes
Step 1: Install Kokoro
Open Terminal and create a project folder:
mkdir ~/youtube-voiceover && cd ~/youtube-voiceover
python3 -m venv venv
source venv/bin/activateInstall Kokoro and its dependencies:
pip install kokoro soundfileKokoro downloads the model weights automatically on first use. The model files live in your Hugging Face cache (~/.cache/huggingface/) and weigh roughly 300 MB.
Step 2: Test with a single line
Create test.py:
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code="a")
generator = pipeline("Welcome to today's video. Let me show you how this works.", voice="af_heart")
for i, (_, _, audio) in enumerate(generator):
sf.write(f"test_{i}.wav", audio, 24000)
print("Done — check test_0.wav")Run it:
python test.pyPlay test_0.wav. If you hear speech, the pipeline works.
Step 3: Load a script from a file
YouTubers write scripts in text files, not Python strings. Save your script as script.txt:
Welcome back to the channel. Today we are breaking down the three tools every creator needs.
First up is screen recording software. You want something lightweight that captures at 60 frames per second.
Second is an audio tool. Bad audio kills a video faster than bad footage.
Third is a local text-to-speech engine so you can iterate on voiceovers without re-recording.
That wraps up this video. Drop a comment with your favorite tool and I will see you in the next one.Now load it from Python — load_script.py:
from kokoro import KPipeline
import soundfile as sf
with open("script.txt") as f:
text = f.read()
pipeline = KPipeline(lang_code="a")
generator = pipeline(text, voice="af_heart")
for i, (_, _, audio) in enumerate(generator):
sf.write(f"full_{i}.wav", audio, 24000)
print(f"Generated {i+1} file(s)")Kokoro’s KPipeline can split longer text into generated chunks. A short script like this usually lands in a single chunk.
Step 4: Pick a voice
Kokoro ships with voices bundled in voice packs. Change the voice parameter to switch.
| Voice ID | Style |
|---|---|
af_heart |
Warm female, friendly |
af_bella |
Bright female, energetic |
af_nicole |
Calm female, measured |
af_sarah |
Neutral female, clear |
am_adam |
Deep male, authoritative |
am_michael |
Warm male, conversational |
am_liam |
Young male, casual |
am_onyx |
Rich male, documentary tone |
bf_emma |
British female |
bm_george |
British male |
bm_lewis |
British male, younger |
Test a few:
voices = ["af_heart", "af_bella", "am_adam", "am_liam"]
pipeline = KPipeline(lang_code="a")
for v in voices:
gen = pipeline("This is a voice preview for YouTube narration.", voice=v)
for i, (_, _, audio) in enumerate(gen):
sf.write(f"preview_{v}.wav", audio, 24000)Pick the one that matches your channel tone and use it consistently.
Step 5: Section-based generation
A YouTube video has a structure — hook, body, outro. Generating separate files per section makes editing easier. When you revise the hook, you regenerate a single file.
Write your sections with ## markers — script_sections.txt:
##hook
Ever feel like your videos take too long to produce? Here is the fix.
##body
Start with a template. Every video I make follows the same four-part structure: hook, setup, demonstration, and call to action.
The template saves about two hours per video because I am not staring at a blank timeline.
Next batch your voiceover. Generate all your narration at once, then edit. Do not generate line by line.
##outro
That is the workflow. Try it on your next video and let me know if it saves you time.Now split and generate per section:
from kokoro import KPipeline
import soundfile as sf
with open("script_sections.txt") as f:
raw = f.read()
sections = {}
current = None
for line in raw.split("\n"):
if line.startswith("##"):
current = line[2:].strip()
sections[current] = []
elif current and line.strip():
sections[current].append(line.strip())
pipeline = KPipeline(lang_code="a")
voice = "af_heart"
for name, lines in sections.items():
text = " ".join(lines)
gen = pipeline(text, voice=voice)
for i, (_, _, audio) in enumerate(gen):
sf.write(f"{name}_{i}.wav", audio, 24000)
print(f"Wrote {name}_{i}.wav")
print("All sections generated.")You now have hook_0.wav, body_0.wav, outro_0.wav — ready to drop into your timeline.
Step 6: Convert to MP3 for YouTube
YouTube handles MP3 well, and MP3 files are smaller than WAV. Use pydub with ffmpeg:
brew install ffmpeg
pip install pydubThen convert all WAV files in your folder to MP3 — convert_to_mp3.py:
import os
from pydub import AudioSegment
for f in os.listdir("."):
if f.endswith(".wav"):
audio = AudioSegment.from_wav(f)
mp3_name = f.replace(".wav", ".mp3")
audio.export(mp3_name, format="mp3", bitrate="192k")
print(f"Converted {f} → {mp3_name}")Import the MP3s into Final Cut Pro, DaVinci Resolve, or Premiere Pro. Section-based files let you swap audio without re-syncing the whole timeline.
Step 7: Handle long scripts (beyond token limit)
Kokoro has a context of roughly 512 tokens. Scripts longer than about 400 words may be split into chunks by KPipeline, but you can control chunking explicitly for better pacing.
Use paragraph breaks as chunk boundaries:
chunks = [p.strip() for p in text.split("\n\n") if p.strip()]
pipeline = KPipeline(lang_code="a")
for idx, chunk in enumerate(chunks):
gen = pipeline(chunk, voice="af_heart")
for i, (_, _, audio) in enumerate(gen):
sf.write(f"chunk_{idx:03d}_{i}.wav", audio, 24000)Number chunk_000_0.wav through chunk_NNN_0.wav keeps ordering predictable in your editor.
Step 8: Full automation script
Save this as render.py and use it for repeat video workflows:
import sys
from kokoro import KPipeline
from pydub import AudioSegment
import soundfile as sf
if len(sys.argv) < 2:
print("Usage: python render.py script.txt [voice]")
sys.exit(1)
script_path = sys.argv[1]
voice = sys.argv[2] if len(sys.argv) > 2 else "af_heart"
with open(script_path) as f:
text = f.read()
# Section split on ## markers
sections = {}
current = None
for line in text.split("\n"):
if line.startswith("##"):
current = line[2:].strip()
sections[current] = []
elif current and line.strip():
sections[current].append(line.strip())
pipeline = KPipeline(lang_code="a")
wav_files = []
for name, lines in sections.items():
content = " ".join(lines)
gen = pipeline(content, voice=voice)
for i, (_, _, audio) in enumerate(gen):
fname = f"{name}_{i}.wav"
sf.write(fname, audio, 24000)
wav_files.append(fname)
print(f" Created {fname}")
# Convert to MP3
print("\nConverting to MP3 ...")
for wav in wav_files:
segment = AudioSegment.from_wav(wav)
mp3 = wav.replace(".wav", ".mp3")
segment.export(mp3, format="mp3", bitrate="192k")
print(f" {wav} → {mp3}")
print(f"\nDone. Created {len(wav_files)} audio file(s).")Usage:
python render.py script_sections.txt af_heartIt reads a ##-marked script, generates WAV per section, converts each to MP3, and prints what it created. After installation, the generation and conversion steps run locally.
Step 9: Pro tips for YouTube voiceover quality
Punctuation controls pacing
Kokoro respects punctuation. Periods create longer pauses than commas. Use sentence fragments and periods to add space:
Good. Now watch this.Compare with:
Good, now watch this.The period version has more dramatic pacing.
Pronunciation tweaks
Kokoro mispronounces product names and acronyms sometimes. Work around it with phonetic spelling:
# Instead of "Kapwing"
text = text.replace("Kapwing", "Cap-wing")
# Instead of "RTX 4090"
text = text.replace("RTX 4090", "are tee ex 4090")
# Instead of "Canva"
text = text.replace("Canva", "Canva")Build a replacements dictionary at the top of your script:
fixes = {
"Kokoro": "Ko-ko-ro",
"KPipeline": "K-Pipeline",
"RTX": "are tee ex",
"UI": "U I",
"README": "read me",
}
for wrong, right in fixes.items():
text = text.replace(wrong, right)Silence trimming
Remove leading/trailing silence from each clip so they sit cleanly on the timeline:
from pydub import AudioSegment
audio = AudioSegment.from_wav("hook_0.wav")
trimmed = audio.strip_silence(threshold=-50, padding=50)
trimmed.export("hook_0_trimmed.wav", format="wav")Consistent voice across a series
Pick one voice and stay with it. Viewers subconsciously associate a voice with a channel. Kokoro’s voices are generally stable for the same text and voice ID, so regeneration can help with recovery after accidental deletion.
Troubleshooting
| Problem | Fix |
|---|---|
No module named kokoro |
Run source venv/bin/activate first |
| Model downloads slowly | First download is ~300MB, subsequent runs are instant |
| Audio sounds robotic | Try a different voice — af_heart is generally the smoothest |
| Long script skips sentences | Split at paragraph breaks, not sentences |
| MP3 conversion fails | Run brew install ffmpeg |
| Python 3 not found | Install from python.org or brew install python |
This workflow is self-contained on your Mac after setup. There is no cloud API step for script generation, which reduces privacy concerns around uploading scripts to a remote TTS service. For creators who want a packaged local voiceover workflow without touching Python, Spokio is a Mac app powered by Chatterbox Turbo, with local voice cloning, batch export, and no cloud uploads for text, audio, or voice samples.
