What does OCR actually do to a PDF?

OCR (Optical Character Recognition) reads the pixels of each page, identifies which pixel clusters form characters, recognises which character each cluster represents, and writes the recognised text into a new invisible text layer that sits behind the original page image. The visible page looks identical; the underlying PDF now contains a searchable, selectable, screen-reader-accessible text layer. The file becomes searchable in any PDF reader, the text can be copied with Cmd+C / Ctrl+C, and accessibility tools can read the document aloud.

Do I need OCR if the PDF was created from a word processor?

No. PDFs exported from Word, Google Docs, LibreOffice, or any digital authoring tool already contain a text layer — every glyph on the page is stored as text, not just as a pixel shape. The PDF is already searchable. OCR is only needed for PDFs whose pages are images: scanned paper documents, photographs of pages, screenshots of text, faxes converted to PDF. To check, open the PDF and try to select a word with click-drag. If the word highlights, you already have a text layer; if nothing highlights, you need OCR.

How accurate is OCR in 2026?

For modern printed text on a clean 300 DPI scan, accuracy is typically 95–99 percent character-level for Latin-script languages, slightly lower for Cyrillic, lower still for CJK, and lowest for Arabic and Indic scripts. Tesseract 5's LSTM engine matches commercial OCR in most cases for clean inputs. Cloud OCR (Google Document AI, Amazon Textract) edges out Tesseract on degraded scans and handwriting. Accuracy drops sharply with low DPI, low contrast, complex multi-column layouts, decorative fonts, and handwriting. For mission-critical accuracy, OCR is a starting point that needs human review, not a final answer.

Which OCR tool should I pick — offline or cloud?

Offline (Tesseract, ocrmypdf, ScoutMyTool) for: confidential documents that should not leave your machine; bulk processing where per-page cloud cost would add up; one-off small jobs where setup overhead of cloud accounts is not worth it. Cloud (Google Document AI, AWS Textract, Azure Form Recognizer) for: degraded scans where local OCR accuracy is insufficient; structured-data extraction from forms or tables; very large batches where the cloud provider's scaling matters more than the per-page cost. For the privacy-conscious default, offline wins; for the highest accuracy on hard inputs, cloud wins.

How do I OCR a multi-language document?

Tell the engine which languages to expect. Tesseract accepts a -l flag with one or more language codes: -l eng+fra runs both English and French recognition; -l chi_sim+eng runs simplified Chinese + English. ScoutMyTool exposes a language picker in the UI. Without the hint, the engine uses a default (usually English) and produces garbled output for other-language portions. Adding too many languages slows recognition and can hurt accuracy on the dominant language; pick the smallest set that covers your document.

Is my PDF uploaded when I use ScoutMyTool OCR?

No. ScoutMyTool's PDF OCR tool runs Tesseract.js — a WebAssembly port of Tesseract — entirely in your browser tab. The PDF is parsed locally with pdf-lib, each page is rasterised to a canvas in the browser, Tesseract.js performs OCR on the canvas, and the recognised text is written into a new text layer in the output PDF. Nothing transits a server. Verify by watching the browser network tab during the operation: zero outbound requests carrying file bytes. For documents containing PII, medical information, or trade secrets, this is materially different from server-uploading OCR services.

Can OCR recover the original layout, or just the text?

It depends on the engine. Tesseract by default produces a flat text layer that follows reading order but loses table structure and multi-column layout. Tesseract's hOCR / ALTO output formats preserve bounding boxes for each word, which post-processing tools use to reconstruct columns and tables. Cloud engines (Google Document AI, Textract) are designed around structured layouts and return tables as structured JSON. For pure search-the-document workflows, flat text is fine; for "extract this scanned spreadsheet into Excel" workflows, a structured-layout engine is worth the cost.

How to make a PDF searchable with OCR…

10 min read

How to make a PDF searchable with OCR, free (2026)

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

I inherited a filing cabinet of old paper records last year, scanned the whole thing to PDF over a weekend, and confidently opened the first file looking for the word "tenancy". Cmd+F returned zero hits. The PDF was a photograph of a page, not a searchable document — the scanner had given me 400 pages of pixels with no text layer underneath. Running them through OCR took twenty minutes and turned the archive from a paperweight into a searchable database. This article is the practical version of what OCR actually does, why a 300 DPI scan is twice as accurate as a 200 DPI scan, and which free tools handle the job in 2026.

What OCR actually does to a PDF

A PDF page is either text-based (every glyph stored as a character in a font) or image-based (the page is just a picture). Text-based PDFs are inherently searchable — the text layer is part of the file structure per the ISO 32000-1 specification. Image-based PDFs are not — there are no characters in the file, only pixels arranged to look like characters.¹

OCR (Optical Character Recognition) bridges the gap. The OCR engine reads the pixels, identifies clusters that look like glyphs, decides which character each cluster represents using a trained model, and writes the recognised text into a new invisible text layer placed behind the original page image. The visible page is unchanged — every pixel sits where it did. Underneath, a positioned text stream now mirrors what is visually on the page, so Cmd+F search works, copy-paste extracts words, and screen readers can read the document aloud.

This dual-layer architecture is what makes "searchable scanned PDF" possible: the visual fidelity of the original scan plus the searchability of a born-digital document, in a single file.

Six OCR engines compared

Engine	Licence	Runs locally?	Languages	Notes
Tesseract 5 (LSTM)	Free, open source (Apache 2.0)	Yes — CLI binary on Linux / macOS / Windows	100+ via downloadable data files (tessdata)	The de-facto OCR standard. Originally developed at HP, now maintained by Google. LSTM-based engine since v4; accurate on clean modern scans, struggles on handwriting and degraded images.
ocrmypdf (CLI wrapper)	Free, open source (MPL)	Yes — wraps Tesseract for PDF input/output	Inherits Tesseract	The right command-line tool for PDF OCR on Linux/macOS. Handles deskewing, image cleanup, and text-layer insertion automatically. Apt/brew install.
ScoutMyTool PDF OCR	Free, ad-supported	Yes — browser-based (Tesseract.js in the page)	Major Latin-script languages bundled; Cyrillic, CJK, Arabic available	Runs in your browser tab; the PDF never uploads. Convenient for one-off OCR without installing tooling.
Adobe Acrobat Pro OCR	Subscription ($19.99/mo as of May 2026)	Yes — bundled with Acrobat	40+ supported	Best layout preservation among the options listed; expensive if OCR is your only Acrobat need.
Google Document AI (cloud)	Paid (~$1.50 / 1k pages)	No — uploads to Google Cloud	200+	Highest accuracy on degraded/handwritten content; the trade-off is that the document leaves your machine.
Amazon Textract / Azure Form Recognizer	Paid (cloud per-page)	No — uploads to AWS / Azure	Major languages	Designed for forms and tables. Worth the cost when extracting structured data, not just text.

Tesseract is the engine that underlies every free OCR option on the table — ocrmypdf wraps it for PDFs, ScoutMyTool runs Tesseract.js (the WebAssembly port) in the browser. Tesseract has been developed continuously since 1985 (originally at HP), open-sourced in 2005, and is now maintained by Google.² The current LSTM-based v5 engine is the right default for most workflows; cloud OCRs win on degraded scans and handwriting, but at the cost of uploading the document.

Seven factors that decide OCR accuracy

"OCR accuracy" varies from "near-perfect" to "useless" depending almost entirely on input quality. Optimise these seven factors at scan time and you can usually skip cloud OCR; let any of them slip and even cloud OCR will struggle.

Factor	Good target	Notes
Source DPI	≥ 300 DPI for printed text; 600 DPI for fine print	Below 200 DPI accuracy drops sharply. Scan at the source if possible; up-sampling a low-resolution scan does not recover detail.
Contrast	Black text on white background	Light grey text on cream paper hurts accuracy. Adjust contrast in pre-processing if the original is washed out.
Skew (rotation)	< 1° rotation	OCR engines auto-deskew, but heavy skew degrades accuracy. ocrmypdf has --deskew built-in.
Page noise	Clean background	Speckle, JPEG artefacts, and hole-punches at the margins reduce accuracy. ocrmypdf --clean handles common cases.
Font	Standard serif or sans-serif at ≥ 8 pt	Decorative or condensed fonts and small print are harder. Mathematical typesetting is particularly hard.
Language	Single language per document or explicit multi-language hint	Mixed Latin/CJK/Arabic without telling the engine which to expect produces garbled output. Pass language hints to the OCR engine.
Document type	Printed text	Handwriting OCR is much less accurate than print OCR; cursive in particular. Cloud engines do better than Tesseract on handwriting.

The National Archives (NARA) digitisation guidelines specifically call for ≥300 DPI for printed text and ≥400 DPI for documents with smaller print sizes — these thresholds reflect what OCR engines need to be reliable.³ Scanning below this threshold to save disk space is a false economy if you ever plan to OCR the result.

Adding OCR via ScoutMyTool — five steps

Open the tool. Go to scoutmytool.com/pdf/pdf-ocr. The page loads as static HTML and the OCR runs entirely in your browser tab using Tesseract.js.
Drop in the scanned PDF. The tool parses the file with pdf-lib and rasterises each page to a canvas at the existing scan resolution.
Pick the language(s). Default is English; the picker shows the most-common languages. Pick multiple if your document is multilingual; be sparing because additional languages slow recognition.
Run OCR. The tool walks every page, performs OCR, and writes the recognised text into a new invisible text layer behind the original image. A progress indicator shows per-page status; expect 2–10 seconds per page on modern hardware.
Download and verify. The output PDF has the same visual content plus a searchable layer. Verify by opening in any reader and pressing Cmd+F / Ctrl+F — should now find words. For higher-stakes documents, spot-check the recognised text against the visual content on a few pages.

The CLI path — ocrmypdf for batch jobs

For batch OCR on a directory of PDFs, the right tool is ocrmypdf — a free command-line wrapper around Tesseract that handles PDF input/output, deskewing, image cleanup, and language selection. Install via apt, brew, or pip; usage is straightforward:

# Single file, English OCR with deskew + cleanup
ocrmypdf --deskew --clean input.pdf output.pdf

# Multi-language (English + French)
ocrmypdf -l eng+fra input.pdf output.pdf

# Batch a directory
for f in *.pdf; do
  ocrmypdf "$f" "ocr_$f"
done

ocrmypdf can also force a re-OCR on a PDF that already has a text layer (useful when the existing layer is wrong), skip pages that already have a good layer (useful for mixed digital-and-scanned documents), and embed colour profiles for archival workflows. For Linux / macOS users, it is the most efficient way to handle hundreds or thousands of PDFs.

PDF OCR — the primary tool referenced.
OCR PDF — alternate entry point to the same OCR engine.
PDF OCR Extract Text Only — produce a .txt file from a scanned PDF, no PDF rebuild.
Compress PDF — shrink scanned PDFs before or after OCR.
PDF to Text — extract text from an already-OCRed PDF.
Tagged PDF Validator — OCR is the prerequisite for tagging scanned documents.
PDF accessibility — sister article; OCR is the foundation of accessible scanned PDFs.
PDF/A conversion — sister article; archive-quality PDFs typically require OCR first.

Frequently asked questions

What does OCR actually do to a PDF?: OCR (Optical Character Recognition) reads the pixels of each page, identifies which pixel clusters form characters, recognises which character each cluster represents, and writes the recognised text into a new invisible text layer that sits behind the original page image. The visible page looks identical; the underlying PDF now contains a searchable, selectable, screen-reader-accessible text layer. The file becomes searchable in any PDF reader, the text can be copied with Cmd+C / Ctrl+C, and accessibility tools can read the document aloud.
Do I need OCR if the PDF was created from a word processor?: No. PDFs exported from Word, Google Docs, LibreOffice, or any digital authoring tool already contain a text layer — every glyph on the page is stored as text, not just as a pixel shape. The PDF is already searchable. OCR is only needed for PDFs whose pages are images: scanned paper documents, photographs of pages, screenshots of text, faxes converted to PDF. To check, open the PDF and try to select a word with click-drag. If the word highlights, you already have a text layer; if nothing highlights, you need OCR.
How accurate is OCR in 2026?: For modern printed text on a clean 300 DPI scan, accuracy is typically 95–99 percent character-level for Latin-script languages, slightly lower for Cyrillic, lower still for CJK, and lowest for Arabic and Indic scripts. Tesseract 5's LSTM engine matches commercial OCR in most cases for clean inputs. Cloud OCR (Google Document AI, Amazon Textract) edges out Tesseract on degraded scans and handwriting. Accuracy drops sharply with low DPI, low contrast, complex multi-column layouts, decorative fonts, and handwriting. For mission-critical accuracy, OCR is a starting point that needs human review, not a final answer.
Which OCR tool should I pick — offline or cloud?: Offline (Tesseract, ocrmypdf, ScoutMyTool) for: confidential documents that should not leave your machine; bulk processing where per-page cloud cost would add up; one-off small jobs where setup overhead of cloud accounts is not worth it. Cloud (Google Document AI, AWS Textract, Azure Form Recognizer) for: degraded scans where local OCR accuracy is insufficient; structured-data extraction from forms or tables; very large batches where the cloud provider's scaling matters more than the per-page cost. For the privacy-conscious default, offline wins; for the highest accuracy on hard inputs, cloud wins.
How do I OCR a multi-language document?: Tell the engine which languages to expect. Tesseract accepts a -l flag with one or more language codes: -l eng+fra runs both English and French recognition; -l chi_sim+eng runs simplified Chinese + English. ScoutMyTool exposes a language picker in the UI. Without the hint, the engine uses a default (usually English) and produces garbled output for other-language portions. Adding too many languages slows recognition and can hurt accuracy on the dominant language; pick the smallest set that covers your document.
Is my PDF uploaded when I use ScoutMyTool OCR?: No. ScoutMyTool's PDF OCR tool runs Tesseract.js — a WebAssembly port of Tesseract — entirely in your browser tab. The PDF is parsed locally with pdf-lib, each page is rasterised to a canvas in the browser, Tesseract.js performs OCR on the canvas, and the recognised text is written into a new text layer in the output PDF. Nothing transits a server. Verify by watching the browser network tab during the operation: zero outbound requests carrying file bytes. For documents containing PII, medical information, or trade secrets, this is materially different from server-uploading OCR services.
Can OCR recover the original layout, or just the text?: It depends on the engine. Tesseract by default produces a flat text layer that follows reading order but loses table structure and multi-column layout. Tesseract's hOCR / ALTO output formats preserve bounding boxes for each word, which post-processing tools use to reconstruct columns and tables. Cloud engines (Google Document AI, Textract) are designed around structured layouts and return tables as structured JSON. For pure search-the-document workflows, flat text is fine; for "extract this scanned spreadsheet into Excel" workflows, a structured-layout engine is worth the cost.

OCR your scanned PDF in the browser, free

Browser-based; runs Tesseract.js client-side. Nothing is uploaded. Outputs a searchable PDF with the original visual content preserved.

Open the free PDF OCR tool →