6 min read
How to convert PDF to audio — text-to-speech for accessibility
By ScoutMyTool Editorial Team · Last updated: 2026-05-20
I listen to roughly half my research-paper backlog rather than read it — TTS quality has improved enough in the last few years that a paper read by a neural voice at 1.5× speed during a walk is genuinely useful intake. The setup is simple but the tool landscape has many options at very different quality and cost levels. This article maps seven TTS approaches for PDFs, the workflow for converting a PDF to a portable MP3, and the accessibility considerations that distinguish a casual listening setup from a WCAG-compliant accessibility one.
PDF-to-audio tools compared
| Tool | Cost | Voice quality | Best for |
|---|---|---|---|
| macOS Speech (built-in) | Free with Mac | Good (Siri voices) | Listening on Mac with no setup |
| Windows Narrator / Speak | Free with Windows | Good (modern voices) | Listening on Windows |
| Adobe Acrobat Read Out Loud | Free Reader | OK (older TTS) | Quick built-in playback |
| iOS Speak Screen | Free with iOS | Good (Siri voices) | Listening on iPhone/iPad |
| NaturalReader (free + paid) | Free tier; paid for premium voices | Free OK; paid excellent | Higher-quality voice for long content |
| Google / Azure Cloud TTS | Pay-per-character | Excellent (neural voices) | Generating MP3 of long PDFs |
| Speechify / Voice Dream Reader | $9–$15/month | Excellent | Dedicated TTS reading workflow |
Step by step — convert PDF to MP3 with cloud TTS
- Extract PDF text using PDF to text or any extractor. For scans, OCR first.
- Clean the text. Remove page-header/footer noise, hyphenation artefacts, equation residue.
- Submit to a TTS service. Google Cloud TTS, Azure Cognitive Services Speech, or ElevenLabs accept text and return audio. Free tiers cover dozens of minutes per month.
- Download the MP3. File size: roughly 1 MB per minute of audio at 128 kbps.
- Sideload to your phone or upload to a podcast app supporting personal feeds. Listen at 1.2–1.5× speed for time efficiency.
When TTS is and is not an accessibility solution
TTS audio is supplementary accessibility — useful for users who prefer audio consumption or who have temporary vision constraints (post-eye-surgery, light sensitivity). For users with permanent vision impairment relying on screen readers, the proper accessibility work is at the PDF-source level: tagged PDF with reading order, alt text on images, defined language tags, document structure metadata. The screen reader (JAWS, NVDA, VoiceOver, TalkBack) then produces high-quality on-demand TTS based on the structure.
Generating MP3 from a PDF and distributing the audio is appropriate for general audience convenience and as one form of accessibility offering — but it does not substitute for accessible-PDF authoring. Organisations producing PDFs subject to ADA Title III (US) or EU Accessibility Directive should invest in tagged PDF production processes; the TTS-MP3 path is an addition, not a replacement.
Voice selection and listening fatigue
Voice choice affects how listenable long audio is. Neural voices vary in timbre, pace, and prosody; the same content sounds different across voices. For long-form listening (research papers, books, multi-hour reports), spend five minutes auditioning two or three voices before committing — the one you think sounds best in a 30-second sample may grate after 30 minutes. Common pattern: a calmer voice for evening / wind-down listening, a more energetic voice for morning / commute listening. The major TTS services let you choose voice per generation; the cost is identical.
Listening speed is the other major lever. Most listeners default to 1.0× (real-time) but can comfortably handle 1.25–1.5× within a few days of practice; experienced TTS listeners often run at 1.75–2.5× for familiar content. The faster you listen, the more material you can cover per commute — a 30-minute commute at 1.5× covers 45 minutes of content; at 2× it covers an hour. Build up speed gradually rather than jumping in at maximum, and reduce when content becomes unfamiliar or you notice comprehension dropping.
Related reading
- PDF accessibility: tagged PDFs and screen-reader support.
- PDF to text: extraction step before TTS.
- Searchable PDF: OCR scanned PDFs before TTS.
- Read PDFs faster: listening at 1.5× as a speed-reading technique.
- PDF tools for students: listening to course readings.
FAQ
- Will modern text-to-speech sound like a real person?
- Neural TTS (Google, Azure, ElevenLabs, Apple Siri voices) sounds nearly indistinguishable from human narration for English and a growing set of other languages. The pause patterns, intonation, and emphasis are close to natural; most listeners would not notice it is synthetic in a casual setting. The older "robotic" TTS (Windows SAPI, older Acrobat voices) sounds dated and tires the ear quickly. For long content (whole books, multi-hour reports), neural TTS is genuinely listenable; legacy TTS works for short content but loses listenability after 20–30 minutes.
- Can I convert a PDF to MP3 and listen on my phone?
- Yes. Three workflows. First, use a service that exports audio: NaturalReader, Speechify, and cloud TTS services (Google, Azure, ElevenLabs) export MP3 files you can download. Second, use a screen recorder (Mac: QuickTime; Windows: Xbox Game Bar) while a TTS reader speaks the PDF — captures the audio as a video file you can extract MP3 from. Third, use a command-line tool: extract PDF text with `pdftotext`, feed to a TTS engine like Coqui TTS (open source) or a cloud API, save the resulting audio. For long PDFs, expect a 1-hour-long audio file at normal speech rates, more like 30 minutes at 1.5× speed.
- How accessible is built-in OS speech vs paid apps?
- For listening, built-in OS speech (macOS Speech, Windows Narrator, iOS Speak Screen) is quite good in 2026 — modern Siri voices, neural voices on Windows 11+. They cover the basic use case (listen to a PDF on the go) for free. Paid apps (Speechify, Voice Dream Reader, NaturalReader Premium) add features: faster playback (up to 4×), better navigation through long documents, library management of multiple PDFs to listen through in sequence, OCR for scanned PDFs. For occasional listening, OS built-in is sufficient; for daily heavy listening, the paid app pays back in workflow polish.
- How do I get TTS for a scanned PDF that has no text layer?
- OCR first. Run ScoutMyTool Make PDF Searchable or any OCR tool to add a text layer to the scanned PDF. After OCR, TTS readers can extract the text and speak it. Without OCR, TTS gets nothing — the PDF is just an image of text. Quality of TTS output depends on OCR quality: clean scans with 99% OCR accuracy produce smooth audio; lower-accuracy OCR introduces mispronunciations that disrupt listening. For high-stakes audio (audiobook production, accessibility for visually impaired readers), prefer high-DPI scans and quality OCR upstream.
- Is TTS conversion appropriate for accessibility — meeting WCAG requirements?
- TTS is one accessibility mode among several. WCAG 2.1 recommends documents be navigable by screen readers (which include their own TTS); ensuring your PDFs are screen-reader compatible (tagged PDFs with proper heading structure, alt text on images, defined reading order) is the foundational accessibility work. TTS-generated audio (MP3) is supplementary — useful for distribution to listeners who prefer audio, but not a substitute for proper screen-reader support. For organisations producing public-facing PDFs subject to ADA or EU accessibility regulations, both are typically required.
Citations
- WCAG 2.1 — Web Content Accessibility Guidelines, including audio-version recommendations.
- ISO 14289 — PDF/UA accessibility standard.
- Apple — VoiceOver and Speak Screen documentation.
- Microsoft — Narrator and Read Aloud documentation.
- Google Cloud Text-to-Speech and Azure Cognitive Services Speech — neural TTS service documentation.
Extract clean text before TTS
ScoutMyTool PDF to text runs in the browser. Clean extraction produces better TTS output than raw PDF feeding into a TTS engine.
Open PDF-to-text →