6 min read
How to extract text and formatting from scanned documents
By ScoutMyTool Editorial Team ยท Last updated: 2026-05-22
Introduction
Someone handed me a box of scanned contracts and asked me to โmake them searchable and pull the key terms into a document.โ A scan, though, is just a picture of a page โ there is no text in it until you run OCR, and OCR, for all its usefulness, makes exactly the mistakes that bite hardest in a contract. This guide is about doing it properly: how OCR turns a scanned document into recoverable text, what formatting survives, why verification is non-negotiable, and how to reformat the result into a clean searchable PDF or an editable Word document depending on whether you need to preserve the look or edit the content.
Pick the output for your goal
| Goal | Output | Note |
|---|---|---|
| Make a scan searchable | Searchable PDF (text layer) | Keeps the original look; adds hidden text |
| Get editable text | Word / plain text | Editable; layout approximate |
| Recover a table | CSV / spreadsheet | Verify every number |
| Preserve formatting | Formatted Word | Simple layouts survive; complex ones shift |
| Bulk archive | Searchable PDF/A | For long-term, findable storage |
Step by step โ OCR and reformat a scan
- Get the cleanest image you can. High resolution (~300 DPI), straight, good contrast. A clean source is the single biggest factor in OCR accuracy.
- OCR with the right language. Run PDF OCR with the correct language pack to recover the text โ see best free OCR.
- Decide: searchable PDF or editable text. Keep a searchable PDF to preserve the look; convert to Word if you need to edit.
- Reformat to Word if editing. Convert with PDF to Word (see PDF to formatted Word) and tidy the layout โ simple pages survive, complex ones shift.
- Extract tables to a spreadsheet. For tabular data, send it to PDF to CSV (see PDF to spreadsheet) and verify every figure.
- Proofread against the original. Check numbers, names, and anything exact โ OCR output is a draft until verified.
- Archive as searchable PDF/A if storing. For long-term, findable records, keep a searchable archival copy alongside any edited version.
Related reading and tools
- Best free OCR: choosing an OCR approach.
- Edit a scanned PDF: working with scans.
- PDF to formatted Word: reformatting recovered text.
- PDF to spreadsheet: recovering tables.
- PDF to Word: the general conversion.
- PDF OCR tool: recognise scanned text in your browser.
- All ScoutMyTool PDF tools: the full toolkit.
FAQ
- What does OCR actually do to a scanned document?
- OCR (optical character recognition) looks at the image of a page and recognises the shapes as characters, producing actual text from what was just a picture. A scanned PDF before OCR is an image โ you cannot select, search, or edit it โ and after OCR you have recoverable text. There are two common outputs: a "searchable PDF" that keeps the original scanned image but adds an invisible text layer underneath (so it looks identical but is now searchable and copyable), and an extracted-text output (Word or plain text) you can edit freely. Which you want depends on whether you need to preserve the original look or to edit the content.
- How much of the original formatting survives?
- It varies with the source and the output you choose. A searchable PDF preserves the exact original appearance perfectly, because it keeps the scanned image and just adds hidden text. Converting to editable Word recovers the text and basic formatting โ paragraphs, bold, headings โ but layout is approximate, and complex multi-column or heavily-designed pages shift, because OCR reconstructs structure by inference. Simple, single-column documents reformat cleanly; intricate ones need cleanup. So if exact look matters, keep the searchable-PDF route; if you need to edit, accept that a complex layout will require tidying after conversion.
- Why do I have to verify OCR output so carefully?
- Because OCR makes mistakes, and it makes them exactly where they hurt. It most often misreads numbers (a smudged 3 as an 8), characters with diacritics, unusual fonts, poor-quality or skewed scans, and dense reference lists or tables. For prose, a few errors are obvious and harmless; for a financial table, a contract figure, or a name, a single misread character is a serious error that looks perfectly plausible. So OCR output is a draft, not a finished document โ proofread it against the original, paying special attention to numbers, names, and anything where exactness matters, before you rely on it. Treat it as "recognise, then verify."
- How do I get the best OCR accuracy?
- Start with the best possible image. A clean, high-resolution scan (around 300 DPI), straight (deskewed), with good contrast and no shadows, OCRs far better than a crooked, low-resolution phone photo. If you are scanning, use a document-scan mode that deskews and sharpens. Choose the correct language pack for the document, since OCR tuned for the wrong language misreads accented characters. For an already-poor scan, you can improve contrast before OCR. No amount of post-processing fully fixes a bad source, so the highest-leverage step is capturing or obtaining a clean image in the first place.
- Can OCR recover tables and reformat them?
- It can extract the text in a table, but reconstructing the table structure is harder and less reliable than recognising body text โ cells can merge or shift, and numbers are exactly what OCR misreads most. For a scanned table you need as data, OCR it, then extract to a spreadsheet and verify every figure against the original, especially totals. Do not trust an OCR'd financial or scientific table without reconciling it. For a table you only need to read, a searchable PDF is fine; for one you need to compute on, budget real verification time, because the cost of a silently wrong number is high.
- Should I keep a searchable PDF or convert to Word?
- Keep a searchable PDF when you want the document to look exactly like the original but be findable and copyable โ ideal for archiving scanned records, contracts, and references, where you are not editing but want to search. Convert to Word when you actually need to edit the content โ revise the text, repurpose it, restructure it. Many workflows use both: a searchable PDF/A for the archive and a Word extraction for the working copy. The searchable PDF is the safer default for records because it preserves the original; reach for Word conversion when editing is the goal.
- Is it safe to OCR a confidential scan online?
- Scanned documents are often exactly the sensitive records (contracts, IDs, medical or financial papers) you would not want uploaded, so prefer a tool that processes files locally. ScoutMyTool runs OCR and conversion entirely in your browser tab, so the scan never leaves your machine. Many online OCR services upload your file to a server. For anything confidential, confirm the tool does not upload before using it.
Citations
- Wikipedia โ โOptical character recognition,โ how OCR recognises text from images. en.wikipedia.org โ OCR
- Wikipedia โ โTesseract (software),โ a widely used open-source OCR engine. en.wikipedia.org/wiki/Tesseract_(software)
- Wikipedia โ โOffice Open XMLโ (ISO/IEC 29500), the .docx target for editable reformatted text. en.wikipedia.org/wiki/Office_Open_XML
Recover the text, keep it private
OCR a scan and reformat it into searchable PDF or editable Word with ScoutMyToolโs in-browser tools โ your scanned documents never leave your machine.
Open PDF OCR โ