8 min read
How to convert PDF to MP3 — text-to-speech for audiobook conversion
By ScoutMyTool Editorial Team · Last updated: 2026-05-20
Introduction
I converted a 400-page non-fiction book PDF into an 11-hour MP3 audiobook last month using ElevenLabs and listened to it during my commute over two weeks. Total cost: about $4 in API fees. The narration was indistinguishable from professional audiobook production for casual listening. Two years ago this workflow was either impossible or produced robotic output; in 2026, AI TTS is genuinely good enough to turn any PDF you own into a portable audiobook. This article maps the seven viable tools, the audio and metadata settings that produce a polished MP3, and the legal and practical realities of doing this for personal listening.
PDF-to-MP3 tools compared
| Tool | Cost | Voice quality | Audio output |
|---|---|---|---|
| Google Cloud Text-to-Speech | $4 per 1M characters (free tier first 4M/month) | Excellent (Studio voices) | MP3 / WAV / OGG |
| Azure Cognitive Services Speech | $1–$16 per 1M characters (tier-dependent) | Excellent (Neural / HD voices) | MP3 / WAV |
| ElevenLabs | Free 10k chars/month; $5+ for paid tiers | Best-in-class for naturalness | MP3 |
| macOS `say` command + ffmpeg | Free | Good (Siri voices) | AIFF → MP3 via ffmpeg |
| NaturalReader | Free tier; $9.99+/mo premium | Free OK; premium excellent | MP3 |
| Coqui TTS (open source) | Free | Good with quality models | WAV → MP3 via ffmpeg |
| Speechify | $11.99–$24/month | Excellent | MP3 |
Step by step — turn a PDF book into a chapter-split MP3 audiobook
- Extract clean text from the PDF. Use PDF to text to extract; for scanned PDFs run Make PDF Searchable first. Open the output text file; remove header/footer junk, page numbers, equation residue, and any text the TTS should not narrate.
- Split by chapter. Identify chapter boundaries (heading patterns: "Chapter 1", "Part I", large-font lines). Save one text file per chapter with sortable numeric prefix: `01-introduction.txt`, `02-chapter-one.txt`, etc.
- Pick a voice and generate audio per chapter. In ElevenLabs: upload the chapter text, select a voice (audition three or four; commit to one for the whole book), generate. Repeat for each chapter. In Google Cloud TTS: use the synthesise-text REST API per chapter file; output MP3 directly. Local options: macOS `say -o output.aiff -v Samantha < chapter.txt`, then `ffmpeg -i output.aiff output.mp3`.
- Add ID3 metadata. Use Mp3tag (free, Win/Mac) or `mid3v2 --artist=Author --album=Book --title=Chapter1 --track=1 file.mp3` (Linux command). Set Genre to "Audiobook" so it sorts correctly in media libraries. Add a cover image extracted from the PDF cover page.
- Sideload to your phone or podcast app. Plug the phone in and copy via USB, or use a cloud-storage app (Dropbox, Drive) to sync. For podcast-app-style chapter navigation, point a personal RSS feed at the MP3s or use an audiobook player (Bound, Smart AudioBook Player, Apple Books with imported file). Open and listen; chapters play in numbered order.
Audio quality and listening considerations
Speech audio is forgiving — 64 kbps mono MP3 is fully listenable, 96 kbps is indistinguishable from higher rates for spoken content, and 128 kbps is the absolute most worth doing. Anything above wastes file size without audible benefit; some users mistakenly use 320 kbps stereo for audiobook output, ending up with 1.5 GB files when 250 MB would be equivalent in quality. Stereo specifically adds zero value to mono narration — the two channels carry identical audio. Always generate mono for TTS audiobooks.
Pronunciation correction is the other quality lever worth knowing about. TTS engines occasionally mispronounce technical terms, proper nouns, foreign words, or acronyms. Most engines support a pronunciation dictionary or inline SSML tags (Speech Synthesis Markup Language) to override default pronunciation. For high-fidelity output on technical books, build a short pronunciation list during a sample read and apply across the full conversion. A 30-minute calibration on the first chapter saves dozens of jarring mispronunciations across a long book.
Legal and ethical considerations
Conversion of legally-owned PDFs to MP3 for personal listening is generally within fair use in the US and analogous frameworks elsewhere. Two areas need care. First, copyright on the underlying book — owning a PDF does not give you the right to distribute audio derivations to others; the personal-use exception is yours alone. Second, voice rights — using ElevenLabs or similar to clone a specific person's voice without permission may violate likeness rights even when the input text is your own. Use the platform's built-in generic voices for safe-by-default audiobooks; reserve voice cloning for content where you own or have permission to the voice.
Related reading
- PDF to audio: live text-to-speech and accessibility focus.
- PDF to text: extraction step before TTS.
- Searchable PDF: OCR scanned PDFs before converting.
- PDF for podcasters: podcast-side audio workflows.
- PDF accessibility: TTS as accessibility supplement.
- PDF tools for students: listening to course readings.
- eBook to PDF: when the source is EPUB rather than PDF.
FAQ
- How good does AI-generated audiobook narration sound in 2026?
- Top-tier neural TTS (ElevenLabs, Google Studio voices, Azure HD voices) is close enough to professional human narration that casual listeners rarely identify it as synthetic during normal listening. Pause patterns, intonation, and emphasis are convincingly natural. The remaining gap is in performance quality — a great voice actor brings interpretation and character that even the best TTS cannot match. For technical documents, research papers, and reference content, AI narration is fully adequate. For literary fiction where voice acting matters, professional human narration still wins; for everything else, modern TTS produces a genuinely listenable audiobook.
- Why convert PDF to MP3 instead of using a TTS reader live?
- MP3 is portable and offline. Once generated, the MP3 plays in any podcast app, music app, car stereo, smart speaker — no live internet, no specific app required. You can listen during a flight, in a tunnel, while jogging without phone signal. Live TTS readers (Acrobat Read Out Loud, Apple Speak Screen) tie you to the device that has the PDF open and the reader configured. For commuting, exercise, and travel listening, the MP3 conversion creates a reusable artifact. The trade-off: generation time upfront (5–60 minutes for a long PDF) vs zero-setup live listening.
- How do I split a long PDF into chapter-sized MP3 files?
- Two approaches. First, manual: extract PDF text, split by chapter headings (`# Chapter 1`, `# Chapter 2`, etc.), generate one MP3 per chapter, name files in chapter order (`01-intro.mp3`, `02-chapter-one.mp3`). Most podcast apps then play the chapters in order. Second, automated: use a Python or Node script that reads the PDF, detects chapter boundaries via heuristics (large-font lines, "Chapter N" patterns, or PDF bookmarks), and generates one MP3 per chapter with appropriate filename prefix. The chapter-per-file approach is much more usable than a single 8-hour MP3 — listeners can stop and resume per chapter without scrubbing.
- What audio settings should I use for an audiobook MP3?
- Bitrate: 64–96 kbps for spoken word is plenty — the codec handles speech efficiently and higher bitrates waste file size without audible improvement. Sample rate: 22.05 kHz or 44.1 kHz; the former is half the file size with negligible quality loss for speech. Channels: mono for narration (speech does not benefit from stereo). Format: MP3 (universal compatibility) or AAC / M4B (better for audiobook chapter metadata). A typical 8-hour audiobook at these settings produces a 250–350 MB file, fits on any phone, streams cleanly over slow connections.
- Can I add chapter markers and metadata to the MP3 so it behaves like an audiobook?
- Yes via ID3v2 tags (MP3) or chapter atoms (M4B / AAC). Set Title (book title), Artist (author name), Album (book title), TrackNumber (chapter sequence), Genre "Audiobook". For chapter markers within a single file, ID3v2 supports CHAP frames pointing to byte offsets — set with tools like mid3v2 (Linux) or Mp3tag (Windows / Mac). For best audiobook-app behaviour (Apple Books, Bound, Smart AudioBook Player), use M4B format which natively supports chapter markers with names. Most TTS services let you set basic ID3 during generation; chapter markers usually need a post-processing step.
- Is it legal to convert a copyrighted PDF book to MP3 for personal listening?
- Generally yes for personal use under fair use (US) and similar frameworks elsewhere, provided you legally own the PDF and the conversion is for your own listening, not distribution. Distributing converted audio of copyrighted material to others (uploading to file shares, sharing with a group) is copyright infringement and may also violate licence terms. For books you legally own as PDF (academic publisher PDFs, books purchased from non-DRM bookstores, public-domain works), personal MP3 conversion is on solid legal footing in most jurisdictions. For DRM-protected commercial audiobooks, do not attempt to bypass DRM — the audio version already exists and is the licensed product.
- How long does conversion take for a typical book?
- Depends on the service. Cloud TTS (Google, Azure, ElevenLabs) processes roughly 1–5× real-time on the service side; a 10-hour book takes 2–10 hours of API time. Local TTS (macOS say, Coqui) runs faster than real-time on a modern laptop — a 10-hour book takes 2–4 hours of local compute. Generate in background and come back; the MP3 is ready when you check. For monthly listening volume, the cloud TTS cost works out to roughly $1–$5 per book at moderate volumes; ElevenLabs free tier covers one short book per month.
Citations
- Google Cloud Text-to-Speech — official API documentation and Studio voice catalogue.
- Microsoft Azure Cognitive Services Speech — Neural and HD voice documentation.
- ElevenLabs — TTS API and voice library documentation.
- W3C — Speech Synthesis Markup Language (SSML) Version 1.1 specification.
- ID3v2 specification — MP3 metadata tag format reference.
- 17 U.S.C. § 107 — US fair-use copyright provisions.
Extract clean text for TTS in your browser
ScoutMyTool PDF to text runs client-side. Extract clean book text — cover to cover — then send only the text (not the original PDF) to your TTS service. The original PDF stays on your machine.
Open PDF-to-text tool →