How to extract text and formatting from scanned documents

Turn a scanned document into editable, formatted text โ€” how OCR works, what formatting survives, why to verify, and reformatting into Word or a searchable PDF.

6 min read

How to extract text and formatting from scanned documents

By ScoutMyTool Editorial Team ยท Last updated: 2026-05-22

Introduction

Someone handed me a box of scanned contracts and asked me to โ€œmake them searchable and pull the key terms into a document.โ€ A scan, though, is just a picture of a page โ€” there is no text in it until you run OCR, and OCR, for all its usefulness, makes exactly the mistakes that bite hardest in a contract. This guide is about doing it properly: how OCR turns a scanned document into recoverable text, what formatting survives, why verification is non-negotiable, and how to reformat the result into a clean searchable PDF or an editable Word document depending on whether you need to preserve the look or edit the content.

Pick the output for your goal

GoalOutputNote
Make a scan searchableSearchable PDF (text layer)Keeps the original look; adds hidden text
Get editable textWord / plain textEditable; layout approximate
Recover a tableCSV / spreadsheetVerify every number
Preserve formattingFormatted WordSimple layouts survive; complex ones shift
Bulk archiveSearchable PDF/AFor long-term, findable storage

Step by step โ€” OCR and reformat a scan

  1. Get the cleanest image you can. High resolution (~300 DPI), straight, good contrast. A clean source is the single biggest factor in OCR accuracy.
  2. OCR with the right language. Run PDF OCR with the correct language pack to recover the text โ€” see best free OCR.
  3. Decide: searchable PDF or editable text. Keep a searchable PDF to preserve the look; convert to Word if you need to edit.
  4. Reformat to Word if editing. Convert with PDF to Word (see PDF to formatted Word) and tidy the layout โ€” simple pages survive, complex ones shift.
  5. Extract tables to a spreadsheet. For tabular data, send it to PDF to CSV (see PDF to spreadsheet) and verify every figure.
  6. Proofread against the original. Check numbers, names, and anything exact โ€” OCR output is a draft until verified.
  7. Archive as searchable PDF/A if storing. For long-term, findable records, keep a searchable archival copy alongside any edited version.

FAQ

What does OCR actually do to a scanned document?
OCR (optical character recognition) looks at the image of a page and recognises the shapes as characters, producing actual text from what was just a picture. A scanned PDF before OCR is an image โ€” you cannot select, search, or edit it โ€” and after OCR you have recoverable text. There are two common outputs: a "searchable PDF" that keeps the original scanned image but adds an invisible text layer underneath (so it looks identical but is now searchable and copyable), and an extracted-text output (Word or plain text) you can edit freely. Which you want depends on whether you need to preserve the original look or to edit the content.
How much of the original formatting survives?
It varies with the source and the output you choose. A searchable PDF preserves the exact original appearance perfectly, because it keeps the scanned image and just adds hidden text. Converting to editable Word recovers the text and basic formatting โ€” paragraphs, bold, headings โ€” but layout is approximate, and complex multi-column or heavily-designed pages shift, because OCR reconstructs structure by inference. Simple, single-column documents reformat cleanly; intricate ones need cleanup. So if exact look matters, keep the searchable-PDF route; if you need to edit, accept that a complex layout will require tidying after conversion.
Why do I have to verify OCR output so carefully?
Because OCR makes mistakes, and it makes them exactly where they hurt. It most often misreads numbers (a smudged 3 as an 8), characters with diacritics, unusual fonts, poor-quality or skewed scans, and dense reference lists or tables. For prose, a few errors are obvious and harmless; for a financial table, a contract figure, or a name, a single misread character is a serious error that looks perfectly plausible. So OCR output is a draft, not a finished document โ€” proofread it against the original, paying special attention to numbers, names, and anything where exactness matters, before you rely on it. Treat it as "recognise, then verify."
How do I get the best OCR accuracy?
Start with the best possible image. A clean, high-resolution scan (around 300 DPI), straight (deskewed), with good contrast and no shadows, OCRs far better than a crooked, low-resolution phone photo. If you are scanning, use a document-scan mode that deskews and sharpens. Choose the correct language pack for the document, since OCR tuned for the wrong language misreads accented characters. For an already-poor scan, you can improve contrast before OCR. No amount of post-processing fully fixes a bad source, so the highest-leverage step is capturing or obtaining a clean image in the first place.
Can OCR recover tables and reformat them?
It can extract the text in a table, but reconstructing the table structure is harder and less reliable than recognising body text โ€” cells can merge or shift, and numbers are exactly what OCR misreads most. For a scanned table you need as data, OCR it, then extract to a spreadsheet and verify every figure against the original, especially totals. Do not trust an OCR'd financial or scientific table without reconciling it. For a table you only need to read, a searchable PDF is fine; for one you need to compute on, budget real verification time, because the cost of a silently wrong number is high.
Should I keep a searchable PDF or convert to Word?
Keep a searchable PDF when you want the document to look exactly like the original but be findable and copyable โ€” ideal for archiving scanned records, contracts, and references, where you are not editing but want to search. Convert to Word when you actually need to edit the content โ€” revise the text, repurpose it, restructure it. Many workflows use both: a searchable PDF/A for the archive and a Word extraction for the working copy. The searchable PDF is the safer default for records because it preserves the original; reach for Word conversion when editing is the goal.
Is it safe to OCR a confidential scan online?
Scanned documents are often exactly the sensitive records (contracts, IDs, medical or financial papers) you would not want uploaded, so prefer a tool that processes files locally. ScoutMyTool runs OCR and conversion entirely in your browser tab, so the scan never leaves your machine. Many online OCR services upload your file to a server. For anything confidential, confirm the tool does not upload before using it.

Citations

  1. Wikipedia โ€” โ€œOptical character recognition,โ€ how OCR recognises text from images. en.wikipedia.org โ€” OCR
  2. Wikipedia โ€” โ€œTesseract (software),โ€ a widely used open-source OCR engine. en.wikipedia.org/wiki/Tesseract_(software)
  3. Wikipedia โ€” โ€œOffice Open XMLโ€ (ISO/IEC 29500), the .docx target for editable reformatted text. en.wikipedia.org/wiki/Office_Open_XML

Recover the text, keep it private

OCR a scan and reformat it into searchable PDF or editable Word with ScoutMyToolโ€™s in-browser tools โ€” your scanned documents never leave your machine.

Open PDF OCR โ†’