How to OCR a scanned PDF and make it searchable

A practical 2026 walkthrough for turning scanned PDFs into searchable PDFs, editable Word docs, or plain text.

10 min read

How to OCR a scanned PDF and make it searchable in 2026

By ScoutMyTool Editorial Team ยท Last updated: 2026-05-18

Introduction

Two weeks ago I needed to find one specific clause inside a 78-page commercial lease that a property manager had emailed me as a phone-camera scan. Cmd-F did nothing because the entire PDF was just pictures of pages. Then I tried three of the well-known online OCR sites โ€” one wanted a signup after two pages, one silently capped my file at 5 MB, one finished but spat out a Word document with every paragraph wrapped in its own text box. Below is the workflow that finally worked, an honest comparison of the free tools, and the things I now check before I waste another 20 minutes on a PDF that did not need OCR in the first place.

What OCR actually does to a PDF

A "PDF" is just a container. Inside that container, each page is either a sequence of text-drawing instructions (a native PDF, the kind your browser produces when you Print to PDF) or a flat picture of a page (a scanned PDF, the kind a flatbed scanner or phone-camera scanner produces). Cmd-F, copy-paste, and screen readers only work on the first kind. OCR โ€” optical character recognition โ€” is the process of looking at each pixel-picture page, identifying the shapes of letters and digits, and emitting the recognised characters as actual text.

The ScoutMyTool OCR pipeline uses OCRmyPDF orchestrating Tesseract, the open-source OCR engine originally developed at HP and maintained by Google since 2006.1 Pages are rendered at 300 DPI (the resolution the U.S. Library of Congress recommends as the OCR-grade minimum for printed material2), recognised character-by-character by Tesseract's LSTM neural net, and either laid back onto the original page as an invisible text layer (searchable PDF), converted to flowed paragraphs by pdf2docx (Word), or returned as raw recognised words (plain text).

Step-by-step: OCR a scanned PDF in your browser

The ScoutMyTool OCR tool runs server-side (Tesseract is a 30MB+ native binary and will not fit into a browser sandbox), but everything you see and configure lives in a single page with no signup screen, no quota counter, and no upload wizard.

  1. Open the tool. Go to scoutmytool.com/pdf/pdf-ocr. Static HTML, no account screen, no cookie banner gate. The dropzone is visible within a second.
  2. Drop your scanned PDF. One file at a time, up to 50 MB. As soon as the file lands, an inline analysis bar shows you the page count, file size, expected processing time, and โ€” if the heuristic detects an image-only PDF โ€” a "Looks scanned" warning so you know OCR will run rather than no-op. If the analysis says the document already has text, you can skip OCR entirely and use PDF to Word or PDF to Text instead.
  3. Pick the output format. Three radio options:
    • Searchable PDF โ€” same look as the original, invisible text layer added. Best for contracts, forms, anything you will print or share visually.
    • Word (.docx) โ€” fully editable Word document. Best when you need to modify the text or paste into another document.
    • Plain text (.txt) โ€” just the recognised words, page-by-page. Best for grepping, indexing, or feeding into a script.
  4. For plain text, choose a language. English, French, Spanish, or German. The language dropdown only appears for the plain-text output โ€” the searchable-PDF and DOCX paths default to English. Pick the dominant language of your document; mixed-language pages work but accuracy on the non-primary language drops.
  5. Click "Run OCR". The progress bar uses the page count and DPI from the pre-upload analysis to draw a realistic curve, not a fake "almost done" stall at 90%. If something looks wrong mid-run (you uploaded the wrong file, the file is huger than you thought, you just want to stop), the cancel button aborts the request cleanly without leaving server-side residue.
  6. Download starts automatically. The output filename is <your-pdf-name>-ocr.<pdf|docx|txt>. A summary panel underneath shows the source size, output size, engine version, and page count โ€” useful when you are processing dozens of files and want a paper trail of what got OCR-processed when.
  7. Verify a sample before trusting the whole document. Open the output, search for a word you know is on a specific page, and check the character-level fidelity of two or three sentences. OCR errors cluster โ€” if page 3 looks perfect, the rest probably is too; if page 3 has "rn" where you expect "m", expect the same substitution throughout.
  8. If the source is too big or password-protected. For files over 50 MB, run Compress PDF first (3โ€“5ร— shrink on image-heavy scans, negligible OCR accuracy loss at default quality) or split with Split PDF into <50 MB chunks and merge the OCR-processed outputs back with Merge PDF. For encrypted PDFs, unlock first via Unlock PDF โ€” the OCR tool refuses to silently strip encryption.

How ScoutMyTool compares to Smallpdf, iLovePDF and PDF2Go

All four offer OCR on the free tier. The meaningful differences are quota, output flexibility (one tool vs. three separate tools), per-file size limit, and whether the engine is disclosed (it matters: open-source Tesseract is independently auditable; closed proprietary OCR is a black box you have to trust on faith).

FeatureScoutMyToolSmallpdfiLovePDFPDF2Go
Free OCR (no quota)Yes2 per day, then paywall1 file per task on free tierYes, up to 100 MB
No signup requiredYesRequired after 2 tasksRequired for files >50 MBYes
Per-file size limit50 MB5 GB Pro / 100 MB free200 MB free100 MB free
Searchable PDF outputYesYesYesYes
Word (.docx) outputYesYes (separate "PDF to Word")Yes (separate tool)Yes (separate tool)
Plain text (.txt) outputYes (4 languages)No direct .txt exportNo direct .txt exportYes
OCR engine disclosedOCRmyPDF + TesseractProprietary (undisclosed)Proprietary (undisclosed)Proprietary (undisclosed)
Auto-deskew + auto-rotateYesYesYesYes
Files deleted after processingYes (immediate)Yes (1 hour)Yes (2 hours)Yes (24 hours)

Third-party tool quotas, size caps, and retention windows are taken from each vendor's public pricing and privacy pages as of May 2026 and may change.

Where Smallpdf and iLovePDF treat searchable-PDF, PDF-to-Word, and PDF-to-text as three separate tools you visit three separate URLs to use, the ScoutMyTool pdf-ocr unifies them โ€” drop once, choose the output, download. If you do not know up-front which format you want, you do not have to guess at the URL.

Three things people get wrong about PDF OCR

  • Running OCR on a PDF that already has a text layer. Native PDFs (Word โ†’ Save As PDF, browser โ†’ Print to PDF) already have selectable text. Running OCR on them wastes a couple of minutes and, in the worst case, produces a second text layer that conflicts with the first and breaks copy-paste. Always check whether your PDF needs OCR first โ€” if you can select a word with the cursor, it does not.
  • Expecting OCR to fix bad scans. The Tesseract LSTM is excellent on clean 300+ DPI input. It cannot reliably read a glare-streaked phone-camera photo of a contract sitting on a glossy desk. If the source image is bad, the answer is to rescan or rephotograph (flat, well-lit, no glare) โ€” not to throw more OCR passes at the bad image.
  • Treating OCR output as proofread. 95% accuracy on a 5,000-word document is still 250 wrong characters. Mission-critical text (legal clauses, dollar amounts, dates, names) always needs a human pass. OCR is a great accelerator for "make this searchable" and "give me the gist"; it is not a substitute for proofreading.

Related PDF tools on ScoutMyTool

  • PDF OCR โ€” the tool this guide is about: scanned PDF โ†’ searchable PDF, Word, or plain text.
  • PDF to Word โ€” for PDFs that already have a text layer (no OCR needed).
  • PDF to Excel โ€” pulls tables out of native PDFs into editable spreadsheets.
  • Compress PDF โ€” shrink a 100 MB scan to fit under the 50 MB OCR cap.
  • Merge PDF โ€” recombine OCR-processed chunks of a split-large document.
  • Unlock PDF โ€” required first step if your scan is password-protected.
  • Sign PDF โ€” for after you have OCR-processed and want to add a signature.

Frequently asked questions

How do I know my PDF actually needs OCR?
Open the PDF in any viewer and try to select a word with your cursor. If the highlight grabs the text, the PDF already has a text layer โ€” OCR is wasted work; use PDF to Text or PDF to Word instead. If the highlight draws a rectangle over what looks like an image, the page is a scan and you need OCR to recover the words. Another giveaway: file size much larger than the page count suggests (scans store pixels, not characters) and a forensic clue in Acrobat's "Document Properties" panel where "Fonts" is empty.
What is the difference between "searchable PDF", DOCX, and plain text output?
Searchable PDF keeps the original page exactly as it looks (the scanned image stays put) but layers an invisible, selectable text layer on top โ€” best when you need to preserve the visual layout (contracts, forms, anything you will print). DOCX gives you an editable Word document where the OCR output is flowed into paragraphs you can modify, at the cost of losing some of the original formatting fidelity. Plain text gives you just the recognised words, page by page, in a .txt file โ€” best when you want to grep, index, or feed text into another script.
How accurate is the OCR?
90โ€“99% on clean, high-DPI printed scans (a freshly scanned office document, a flatbed scan of a printed book). Accuracy drops with low-light phone photos, handwriting, skewed pages, very small fonts (below ~6pt), unusual scripts, and heavily decorated or watermarked backgrounds. Always proofread mission-critical text โ€” OCR errors are typically character-level substitutions ("0" vs "O", "rn" vs "m") that a spell-checker will not flag.
How long does processing take?
About 10 seconds per page on a typical scanned contract. A 5-page lease finishes in under a minute; a 50-page deposition takes 2โ€“3 minutes. The pre-upload analysis above the dropzone gives a tighter estimate based on your specific file page count and DPI. Cancel anytime via the inline cancel button โ€” long-running OCR is the most cancel-worthy operation on the site, so we made it a one-click action.
What languages does it support?
English, French, Spanish, and German for plain-text output (selectable in the language dropdown). Searchable PDF and DOCX modes default to English. Multi-language pages (e.g. an English contract with a French annex) work in mixed mode but accuracy on the non-primary language drops โ€” best to split the document and run each language separately if precision matters.
Is my file uploaded to your servers?
Yes โ€” OCR has to run server-side because Tesseract does not run in the browser (it is a 30MB+ native binary that depends on system libraries). Your PDF is uploaded over HTTPS, processed in a per-request temp directory, the output is streamed back to your browser, and the temp files are deleted immediately after the response is sent. We do not archive uploads, do not use them to train any model, and do not share them. The deletion is structural โ€” there is no "keep" code path that would retain the file even if we wanted one.
What is the file size limit and what if my PDF is bigger?
50 MB per file. If your scan exceeds that, the cheapest fix is to run it through Compress PDF first (image-heavy scans compress 3โ€“5ร— with negligible OCR quality loss at the default quality setting) and then OCR the compressed copy. If you need OCR on a genuinely large document, split it with Split PDF into chunks under 50 MB, OCR each chunk, then merge the OCR-output searchable PDFs back together with Merge PDF.

Ready to OCR your PDF?

No signup, no daily quota, three output formats (searchable PDF, Word, plain text), files deleted immediately after processing. Open-source Tesseract under the hood, not a black-box proprietary engine.

Open the free PDF OCR tool at scoutmytool.com/pdf/pdf-ocr โ†’

References

  1. Smith, R., An Overview of the Tesseract OCR Engine, Google Inc., Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). Canonical reference for the Tesseract engine architecture and origin (HP 1985 โ†’ Google open-source 2006 โ†’ LSTM rewrite 2018). Available via the Tesseract project documentation: tesseract-ocr.github.io.
  2. U.S. Library of Congress, Technical Guidelines for Digitizing Cultural Heritage Materials (Federal Agencies Digital Guidelines Initiative, FADGI). Recommends 300โ€“400 DPI as the minimum capture resolution for printed text destined for OCR. Public reference: digitizationguidelines.gov/guidelines/digitize-technical.html.