How to make a PDF searchable with OCR, free

How OCR adds an invisible text layer to a scanned PDF, which engine to pick, and how to maximise accuracy.

10 min read

How to make a PDF searchable with OCR, free (2026)

By ScoutMyTool Editorial Team ยท Last updated: 2026-05-20

Introduction

I inherited a filing cabinet of old paper records last year, scanned the whole thing to PDF over a weekend, and confidently opened the first file looking for the word "tenancy". Cmd+F returned zero hits. The PDF was a photograph of a page, not a searchable document โ€” the scanner had given me 400 pages of pixels with no text layer underneath. Running them through OCR took twenty minutes and turned the archive from a paperweight into a searchable database. This article is the practical version of what OCR actually does, why a 300 DPI scan is twice as accurate as a 200 DPI scan, and which free tools handle the job in 2026.

What OCR actually does to a PDF

A PDF page is either text-based (every glyph stored as a character in a font) or image-based (the page is just a picture). Text-based PDFs are inherently searchable โ€” the text layer is part of the file structure per the ISO 32000-1 specification. Image-based PDFs are not โ€” there are no characters in the file, only pixels arranged to look like characters.1

OCR (Optical Character Recognition) bridges the gap. The OCR engine reads the pixels, identifies clusters that look like glyphs, decides which character each cluster represents using a trained model, and writes the recognised text into a new invisible text layer placed behind the original page image. The visible page is unchanged โ€” every pixel sits where it did. Underneath, a positioned text stream now mirrors what is visually on the page, so Cmd+F search works, copy-paste extracts words, and screen readers can read the document aloud.

This dual-layer architecture is what makes "searchable scanned PDF" possible: the visual fidelity of the original scan plus the searchability of a born-digital document, in a single file.

Six OCR engines compared

EngineLicenceRuns locally?LanguagesNotes
Tesseract 5 (LSTM)Free, open source (Apache 2.0)Yes โ€” CLI binary on Linux / macOS / Windows100+ via downloadable data files (tessdata)The de-facto OCR standard. Originally developed at HP, now maintained by Google. LSTM-based engine since v4; accurate on clean modern scans, struggles on handwriting and degraded images.
ocrmypdf (CLI wrapper)Free, open source (MPL)Yes โ€” wraps Tesseract for PDF input/outputInherits TesseractThe right command-line tool for PDF OCR on Linux/macOS. Handles deskewing, image cleanup, and text-layer insertion automatically. Apt/brew install.
ScoutMyTool PDF OCRFree, ad-supportedYes โ€” browser-based (Tesseract.js in the page)Major Latin-script languages bundled; Cyrillic, CJK, Arabic availableRuns in your browser tab; the PDF never uploads. Convenient for one-off OCR without installing tooling.
Adobe Acrobat Pro OCRSubscription ($19.99/mo as of May 2026)Yes โ€” bundled with Acrobat40+ supportedBest layout preservation among the options listed; expensive if OCR is your only Acrobat need.
Google Document AI (cloud)Paid (~$1.50 / 1k pages)No โ€” uploads to Google Cloud200+Highest accuracy on degraded/handwritten content; the trade-off is that the document leaves your machine.
Amazon Textract / Azure Form RecognizerPaid (cloud per-page)No โ€” uploads to AWS / AzureMajor languagesDesigned for forms and tables. Worth the cost when extracting structured data, not just text.

Tesseract is the engine that underlies every free OCR option on the table โ€” ocrmypdf wraps it for PDFs, ScoutMyTool runs Tesseract.js (the WebAssembly port) in the browser. Tesseract has been developed continuously since 1985 (originally at HP), open-sourced in 2005, and is now maintained by Google.2 The current LSTM-based v5 engine is the right default for most workflows; cloud OCRs win on degraded scans and handwriting, but at the cost of uploading the document.

Seven factors that decide OCR accuracy

"OCR accuracy" varies from "near-perfect" to "useless" depending almost entirely on input quality. Optimise these seven factors at scan time and you can usually skip cloud OCR; let any of them slip and even cloud OCR will struggle.

FactorGood targetNotes
Source DPIโ‰ฅ 300 DPI for printed text; 600 DPI for fine printBelow 200 DPI accuracy drops sharply. Scan at the source if possible; up-sampling a low-resolution scan does not recover detail.
ContrastBlack text on white backgroundLight grey text on cream paper hurts accuracy. Adjust contrast in pre-processing if the original is washed out.
Skew (rotation)< 1ยฐ rotationOCR engines auto-deskew, but heavy skew degrades accuracy. ocrmypdf has --deskew built-in.
Page noiseClean backgroundSpeckle, JPEG artefacts, and hole-punches at the margins reduce accuracy. ocrmypdf --clean handles common cases.
FontStandard serif or sans-serif at โ‰ฅ 8 ptDecorative or condensed fonts and small print are harder. Mathematical typesetting is particularly hard.
LanguageSingle language per document or explicit multi-language hintMixed Latin/CJK/Arabic without telling the engine which to expect produces garbled output. Pass language hints to the OCR engine.
Document typePrinted textHandwriting OCR is much less accurate than print OCR; cursive in particular. Cloud engines do better than Tesseract on handwriting.

The National Archives (NARA) digitisation guidelines specifically call for โ‰ฅ300 DPI for printed text and โ‰ฅ400 DPI for documents with smaller print sizes โ€” these thresholds reflect what OCR engines need to be reliable.3 Scanning below this threshold to save disk space is a false economy if you ever plan to OCR the result.

Adding OCR via ScoutMyTool โ€” five steps

  1. Open the tool. Go to scoutmytool.com/pdf/pdf-ocr. The page loads as static HTML and the OCR runs entirely in your browser tab using Tesseract.js.
  2. Drop in the scanned PDF. The tool parses the file with pdf-lib and rasterises each page to a canvas at the existing scan resolution.
  3. Pick the language(s). Default is English; the picker shows the most-common languages. Pick multiple if your document is multilingual; be sparing because additional languages slow recognition.
  4. Run OCR. The tool walks every page, performs OCR, and writes the recognised text into a new invisible text layer behind the original image. A progress indicator shows per-page status; expect 2โ€“10 seconds per page on modern hardware.
  5. Download and verify. The output PDF has the same visual content plus a searchable layer. Verify by opening in any reader and pressing Cmd+F / Ctrl+F โ€” should now find words. For higher-stakes documents, spot-check the recognised text against the visual content on a few pages.

The CLI path โ€” ocrmypdf for batch jobs

For batch OCR on a directory of PDFs, the right tool is ocrmypdf โ€” a free command-line wrapper around Tesseract that handles PDF input/output, deskewing, image cleanup, and language selection. Install via apt, brew, or pip; usage is straightforward:

# Single file, English OCR with deskew + cleanup
ocrmypdf --deskew --clean input.pdf output.pdf

# Multi-language (English + French)
ocrmypdf -l eng+fra input.pdf output.pdf

# Batch a directory
for f in *.pdf; do
  ocrmypdf "$f" "ocr_$f"
done

ocrmypdf can also force a re-OCR on a PDF that already has a text layer (useful when the existing layer is wrong), skip pages that already have a good layer (useful for mixed digital-and-scanned documents), and embed colour profiles for archival workflows. For Linux / macOS users, it is the most efficient way to handle hundreds or thousands of PDFs.

  • PDF OCR โ€” the primary tool referenced.
  • OCR PDF โ€” alternate entry point to the same OCR engine.
  • PDF OCR Extract Text Only โ€” produce a .txt file from a scanned PDF, no PDF rebuild.
  • Compress PDF โ€” shrink scanned PDFs before or after OCR.
  • PDF to Text โ€” extract text from an already-OCRed PDF.
  • Tagged PDF Validator โ€” OCR is the prerequisite for tagging scanned documents.
  • PDF accessibility โ€” sister article; OCR is the foundation of accessible scanned PDFs.
  • PDF/A conversion โ€” sister article; archive-quality PDFs typically require OCR first.

Frequently asked questions

What does OCR actually do to a PDF?
OCR (Optical Character Recognition) reads the pixels of each page, identifies which pixel clusters form characters, recognises which character each cluster represents, and writes the recognised text into a new invisible text layer that sits behind the original page image. The visible page looks identical; the underlying PDF now contains a searchable, selectable, screen-reader-accessible text layer. The file becomes searchable in any PDF reader, the text can be copied with Cmd+C / Ctrl+C, and accessibility tools can read the document aloud.
Do I need OCR if the PDF was created from a word processor?
No. PDFs exported from Word, Google Docs, LibreOffice, or any digital authoring tool already contain a text layer โ€” every glyph on the page is stored as text, not just as a pixel shape. The PDF is already searchable. OCR is only needed for PDFs whose pages are images: scanned paper documents, photographs of pages, screenshots of text, faxes converted to PDF. To check, open the PDF and try to select a word with click-drag. If the word highlights, you already have a text layer; if nothing highlights, you need OCR.
How accurate is OCR in 2026?
For modern printed text on a clean 300 DPI scan, accuracy is typically 95โ€“99 percent character-level for Latin-script languages, slightly lower for Cyrillic, lower still for CJK, and lowest for Arabic and Indic scripts. Tesseract 5's LSTM engine matches commercial OCR in most cases for clean inputs. Cloud OCR (Google Document AI, Amazon Textract) edges out Tesseract on degraded scans and handwriting. Accuracy drops sharply with low DPI, low contrast, complex multi-column layouts, decorative fonts, and handwriting. For mission-critical accuracy, OCR is a starting point that needs human review, not a final answer.
Which OCR tool should I pick โ€” offline or cloud?
Offline (Tesseract, ocrmypdf, ScoutMyTool) for: confidential documents that should not leave your machine; bulk processing where per-page cloud cost would add up; one-off small jobs where setup overhead of cloud accounts is not worth it. Cloud (Google Document AI, AWS Textract, Azure Form Recognizer) for: degraded scans where local OCR accuracy is insufficient; structured-data extraction from forms or tables; very large batches where the cloud provider's scaling matters more than the per-page cost. For the privacy-conscious default, offline wins; for the highest accuracy on hard inputs, cloud wins.
How do I OCR a multi-language document?
Tell the engine which languages to expect. Tesseract accepts a -l flag with one or more language codes: -l eng+fra runs both English and French recognition; -l chi_sim+eng runs simplified Chinese + English. ScoutMyTool exposes a language picker in the UI. Without the hint, the engine uses a default (usually English) and produces garbled output for other-language portions. Adding too many languages slows recognition and can hurt accuracy on the dominant language; pick the smallest set that covers your document.
Is my PDF uploaded when I use ScoutMyTool OCR?
No. ScoutMyTool's PDF OCR tool runs Tesseract.js โ€” a WebAssembly port of Tesseract โ€” entirely in your browser tab. The PDF is parsed locally with pdf-lib, each page is rasterised to a canvas in the browser, Tesseract.js performs OCR on the canvas, and the recognised text is written into a new text layer in the output PDF. Nothing transits a server. Verify by watching the browser network tab during the operation: zero outbound requests carrying file bytes. For documents containing PII, medical information, or trade secrets, this is materially different from server-uploading OCR services.
Can OCR recover the original layout, or just the text?
It depends on the engine. Tesseract by default produces a flat text layer that follows reading order but loses table structure and multi-column layout. Tesseract's hOCR / ALTO output formats preserve bounding boxes for each word, which post-processing tools use to reconstruct columns and tables. Cloud engines (Google Document AI, Textract) are designed around structured layouts and return tables as structured JSON. For pure search-the-document workflows, flat text is fine; for "extract this scanned spreadsheet into Excel" workflows, a structured-layout engine is worth the cost.

OCR your scanned PDF in the browser, free

Browser-based; runs Tesseract.js client-side. Nothing is uploaded. Outputs a searchable PDF with the original visual content preserved.

Open the free PDF OCR tool โ†’

References

  1. ISO 32000-1:2008, Document management โ€” Portable document format โ€” Part 1: PDF 1.7. Public reference copy: opensource.adobe.com PDF32000_2008. Text-showing operators (Tj, TJ, etc.) and font-text content streams described in ยง9.
  2. Tesseract OCR project (Google, formerly HP), Tesseract User Manual and source. tesseract-ocr.github.io and github.com/tesseract-ocr/tesseract (accessed May 2026). The de-facto open-source OCR engine.
  3. US National Archives and Records Administration (NARA), Technical Guidelines for Digitizing Cultural Heritage Materials. archives.gov โ€” Digitization Guidelines (accessed May 2026). Authoritative US federal guidance on scanning DPI, contrast, and source conditions for archival OCR.