OCR explained — how scanned PDFs become searchable text

A practical 2026 plain-English explainer on OCR and how to make a scanned PDF searchable.

7 min read

OCR explained — how scanned PDFs become searchable text

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

After working with hundreds of users on scanned-document workflows, OCR is one of the most-mentioned and least-understood operations in the PDF universe. The mental model people start with is "OCR finds the text in my scan", which is approximately right but skips the details that explain why some scans turn into clean searchable PDFs and others come out as gibberish. Below is what OCR actually does, the conditions that make it work or fail, and the workflow for adding a searchable text layer to a scanned PDF without uploading the file anywhere.

What OCR actually does, step by step

  1. Page is rasterised. The scanned PDF already contains an image for each page; OCR reads that image into pixel buffers.
  2. Image is pre-processed. De-skew (rotate so text is horizontal), binarise (high-contrast black-on-white), de-noise. Modern engines do all of this automatically; the quality of pre-processing determines the quality of recognition.
  3. Layout is segmented. The image is broken into regions: paragraphs, columns, tables, headings, captions. Multi-column papers need this to be done right or the output reads left-then-right across columns instead of top-to-bottom within each.
  4. Lines and words are isolated. Within each region, the binarised image is split into lines and lines into words based on horizontal whitespace.
  5. Characters are recognised. Each word-image is fed to the character-recognition model (LSTM in Tesseract 5, deep neural net in modern commercial engines). The model outputs a sequence of character predictions with per-character confidence scores.
  6. Language model corrects errors. The raw character predictions are passed through a language-specific spell / grammar model that fixes common ambiguities ("l" vs "I" vs "1", "0" vs "O", "rn" vs "m") using dictionary lookups and n-gram probabilities.
  7. Text is written back to the PDF. Each recognised word is placed as an invisible text object at the same coordinates as the original image of the word, so search and copy-paste behave correctly even though the visible page is still the scanned image.

Step-by-step: make a scanned PDF searchable

The ScoutMyTool tool lives at scoutmytool.com/pdf/pdf-ocr. Runs client-side via Tesseract WebAssembly — no upload, no signup, no quota.

  1. Drop your scanned PDF. Loads into a sandboxed memory buffer; nothing is uploaded.
  2. Pick (or confirm) the language. Auto- detects from the first page. Override for non-Latin scripts or multi-language documents.
  3. Pick OCR quality. Default "Standard" uses Tesseract 5 LSTM — good balance of speed and accuracy. "Best" enables additional pre-processing passes; ~3× slower, ~1–3 percentage points more accurate. "Fast" trades accuracy for speed; useful for previewing or rough indexing.
  4. Pick output mode. "Searchable PDF" (default) — original image preserved, invisible text layer added. "Text-only" — produces a plain .txt file with no PDF wrapper, for downstream NLP processing. "Side-by-side" — original scan plus a parallel text page, useful for archival.
  5. Click Recognise. Page-by-page progress shown live. Expect 10–30 seconds per page on a modern laptop; large documents are slow but predictable.
  6. Review the confidence report. The tool reports per-page average confidence (90%+ is good; below 80% is suspect). Low-confidence pages are flagged for review.
  7. Download the searchable PDF. Open in any PDF reader and try Ctrl-F to confirm the text layer is working. The visible page should look identical to the input scan.
  8. If recognition is poor on specific pages.Re-scan those pages at higher resolution (300 dpi minimum, 600 dpi for fine print), or photograph with better lighting and a flatter angle. OCR cannot recover detail that was never captured.

Why OCR sometimes recognises perfectly and sometimes fails

OCR accuracy is dominated by input quality. The cheapest way to get better recognition is to capture a better image; no amount of post-processing recovers what was not there. Specifically:

  • Resolution. 300 dpi is the floor for clean printed text. 600 dpi helps for small print or fine serifs. Below 200 dpi the per-letter pixel count is too low for any engine to recognise reliably.
  • Contrast. Black text on white background is ideal. Pastel text on white, dark text on dark background, or low-contrast camera photos all hurt.
  • Skew. <1° is fine. 1–5° handled by automatic de-skew. >5° introduces errors.
  • Compression artefacts. Heavy JPEG compression introduces ringing around text edges that the OCR engine misreads.
  • Bleed-through. Scanning a double-sided page with the back showing through hurts both sides. Place a black backing sheet to scan.

The U.S. National Archives publishes scanning guidance specifically for OCR-ready output, recommending 300+ dpi, grayscale or colour rather than bitonal for OCR input, and disabling automatic image enhancement (which often hurts OCR even though it looks better to the human eye)1.

Related ScoutMyTool articles and tools

Frequently asked questions

What does OCR actually do?
OCR (optical character recognition) takes an image of text — a photograph, a scan, a screenshot — and produces the underlying text as machine-readable characters. The image of the letter "A" becomes the byte 0x41 in a text stream that any program can search, copy, and process. For PDFs specifically, OCR adds an invisible text layer over the visible image: you still see the scan, but Ctrl-F now finds words and copy-paste produces real text. The visual appearance of the document does not change.
How accurate is modern OCR?
For clean printed text on a flat scan: 99%+ accuracy with modern engines (Tesseract 5.x, ABBYY FineReader, Microsoft Read API). For typewriter and old print: 97–99%. For low-resolution camera photos of text: 90–97%, depending on angle and lighting. For handwritten text: highly variable — 60–95% for clean printing, lower for cursive. The trend over the last 5 years has been steady improvement driven by deep-learning models replacing the older rule-based OCR engines.
Why does some text in my scan come out as gibberish?
Five common causes. (a) Image resolution too low — OCR needs roughly 300 dpi for printed text; scans at 96 dpi or photos at typical phone resolution often produce gibberish. (b) Skew or rotation — text rotated more than ~5° degrades recognition; the tool de-skews automatically but extreme angles fail. (c) Bad contrast — faint text on a coloured background loses the binary edges OCR relies on. (d) Wrong language — OCR engines need to know which language to recognise; English-only recognition on a French document fails. (e) Unusual fonts or stylised lettering — handwritten or display fonts trained against rarely recognise well.
How does ScoutMyTool's OCR work? Is it run locally or in the cloud?
Browser-side. The tool uses a WebAssembly build of Tesseract 5 that runs entirely in your browser tab. The scanned PDF is loaded into a sandboxed memory buffer, each page is rasterised, the Tesseract WASM module recognises text per page, and the recognised text is written back as an invisible text layer over the original image. Nothing is uploaded. Verify in DevTools Network — zero outbound requests during OCR. The cost: OCR is slow (10–30 seconds per page on a modern laptop) but the privacy guarantee is real.
Which languages does the tool support?
The full Tesseract 5 language set — over 100 languages, including all European languages, Arabic, Hebrew, CJK (Chinese, Japanese, Korean), Indic scripts, and several historical scripts (Fraktur, old Greek). The tool auto-detects the language on the first page; you can override per-document. For mixed-language documents (e.g. English with French quotations), pick "Multi-language" which loads multiple language data files and chooses per region.
What about handwriting and signatures?
Handwriting recognition (HWR / ICR) is a different problem from print OCR and typically uses different models. Clean block-letter handwriting (forms with one letter per box) recognises 80–95% with the right HWR model — the tool offers this as a separate "Form HWR" mode. Cursive handwriting is much harder and the tool does not attempt it reliably; for cursive documents, the result is best treated as a "starting transcript that needs human review". Signatures are not recognised as text at all — they remain image marks.
Will OCR change how my PDF looks?
No — only what is searchable. OCR adds an invisible text layer that sits at the same coordinates as the visible scan; the reader displays only the scan (the image), but Ctrl-F searches the invisible text layer and finds matches. Copy-paste from a "selected" region returns the recognised text rather than the image bytes. The output PDF is byte-larger than the input by the size of the text layer (typically 5–20% size increase for a clean printed page) but visually indistinguishable.

OCR your scanned PDF now — free, no signup, no upload

Tesseract 5 LSTM in your browser, 100+ languages, invisible text layer over the original scan. Runs entirely in your browser.

Open the PDF OCR tool at scoutmytool.com/pdf/pdf-ocr →