Convert scanned PDF to Word — preserve formatting

The OCR-then-convert two-step pipeline, which tools handle layout best, and realistic expectations.

10 min read

Convert scanned PDF to Word — preserve formatting (2026)

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

Introduction

A client once asked me to "make a few small edits" to a 60-page contract he had only as a scanned PDF. The first converter I tried produced a Word document where every page was a single image with a transparent text layer floating on top — not really editable. The second produced clean text but with the multi-column layout collapsed into a single column and the tables flattened into paragraphs. The third — Adobe Acrobat Pro — produced something close to the original, but cost a $19.99 month I had not budgeted for. The lesson: scanned-PDF-to-Word is a two-step process where the second step (layout reconstruction) is where most free tools fall short. This article is the practical guide.

The two-step pipeline

Every scanned-PDF-to-Word conversion is some variant of the same pipeline:

  1. OCR (Optical Character Recognition). The engine reads the pixels of each page, identifies characters, and produces a text stream with positional information for each word. Tesseract is the de-facto open-source engine; Adobe and ABBYY use proprietary engines; cloud providers (Google Document AI, AWS Textract) expose their own.
  2. Layout reconstruction. The OCR output is converted into a Word document with paragraphs, headings, tables, lists, and embedded images placed in their original positions. This is the harder step: paragraphs are straightforward, tables are tricky, multi-column layouts are tricky, mixed text and image regions are tricky.

The PDF being scanned (i.e., image-based) rather than text-based per ISO 32000-1 is what makes step 1 necessary at all;1 the Office Open XML structure of the .docx output (ISO/IEC 29500) is what step 2 is targeting.2 Different converters spend different proportions of effort on the two steps; the best ones invest heavily in step 2 because that is where user-visible quality is decided.

Six tools compared

ToolCostOCR engineLayout fidelityBest for
Adobe Acrobat Pro$19.99/month or annual; one-month rental possibleAdobe proprietaryBest on complex layouts (multi-column, tables, footnotes)Highest fidelity needed; budget allows; documents where re-editing matters more than the cost
Google Drive / DocsFree with Google accountGoogle (cloud)Good on simple single-column text; loses tables and complex layoutsQuick free conversion of clean single-column documents; comfortable uploading to Google
ABBYY FineReader PDF$199/year or perpetual licence ~$200ABBYY (proprietary, the long-time industry leader)Excellent — often matches Adobe on layout, sometimes exceeds it on languagesHeavy professional use (legal e-discovery, archival, multi-language documents)
Tesseract + LibreOffice (CLI)Free, open sourceTesseract 5Reasonable on simple documents; manual layout fixes needed on complex onesLinux / macOS power users; batch pipelines; privacy-conscious workflows
ScoutMyTool PDF to WordFree, ad-supportedTesseract.js (browser-based) for scans + pdf-lib for text PDFsSolid on simple layouts; degrades on heavy multi-column / table contentNo-install conversion of single-column documents; nothing leaves your machine
Cloud OCR services (Google Document AI, AWS Textract)Per-page (~$1.50/1k pages)Provider-specificExcellent on structured forms; good on documentsBulk pipelines extracting structured data from many forms; not pure document conversion

For most users the choice is between Adobe Acrobat Pro (paid, best fidelity), ABBYY FineReader (paid, also excellent and historically the OCR specialist), or one of the free options. For a one-off conversion of a single document where re-editing matters, a one-month Adobe Acrobat Pro rental at $19.99 is usually the cheapest path to a clean Word file. For everyday conversions of simple documents, the free options are sufficient.

What survives the conversion, realistically

"Preserve formatting" is an honest aspiration but the reality varies by element. Here is what to expect from a modern (2026) converter pipeline.

ElementRealistic expectation
Plain text in single columnExcellent — modern OCR + word-processor conversion handles this near-perfectly.
Multiple columns (newspaper style)Variable. Adobe Acrobat Pro and ABBYY usually preserve columns; lower-tier converters often merge or scramble them.
Tables with rulesAdobe + ABBYY recognise table structure and emit Word tables; simpler tools emit a grid of text boxes or a flattened paragraph.
Tables without rules (whitespace-aligned)Difficult for all converters. Manual cleanup almost always needed.
Headers and footersUsually preserved but may be inserted as body paragraphs rather than Word header/footer regions.
Page numbersRecognised but may need to be re-assigned to Word's page-number feature for clean re-pagination.
FootnotesAdobe handles footnote-to-Word conversion best; simpler tools render footnotes as bottom-of-page paragraphs.
Images and diagramsEmbedded as images in Word at the OCR resolution. Text inside the image becomes selectable but the image itself is not editable as vector.
Custom fontsSubstituted with the closest Word system font. Visual appearance will shift unless the recipient has the original font installed.
Mathematical equationsPoor across the board. Latex/equation OCR is a separate specialised category (Mathpix, InftyReader). General OCR garbles equations.
HandwritingCloud OCR (Google Document AI, AWS Textract) handles printed-handwriting better than open-source; cursive remains hard for everything.

Converting via ScoutMyTool — five steps

  1. Verify the source is actually a scanned PDF. Open the PDF and try to select a word with click-drag. If text highlights, the PDF already has a text layer and you can skip the OCR step — use PDF to Word directly. If nothing highlights, continue to the OCR step.
  2. Run OCR first (recommended). ScoutMyTool's PDF OCR tool produces a searchable PDF as an intermediate artefact. Doing OCR explicitly first is recommended because the searchable PDF is independently useful and lets you verify the OCR text before committing to a Word conversion.
  3. Pick languages. Default is English; the picker offers other common languages. For multi-language documents, pick all that apply.
  4. Convert the searchable PDF to Word. Open PDF to Word, drop in the OCRed PDF, and the tool writes a .docx with the OCR text laid out as paragraphs. Tables and complex layouts may need manual cleanup; single-column text usually lands cleanly.
  5. Review and clean up. Open the .docx in Word, scan the first and last pages for OCR errors, fix any obvious mistakes, and re-apply formatting that did not transfer (table structure, custom fonts, headers/footers if needed). For critical documents, allocate roughly as much time to review as to conversion.

The CLI path — for batch jobs

For Linux / macOS users with many scanned PDFs to convert, the combination of ocrmypdf and libreoffice --headless handles batch jobs without per-page cost or rate limits:

# Step 1: OCR every PDF in the current directory
for f in *.pdf; do
  ocrmypdf --deskew --clean "$f" "ocr_$f"
done

# Step 2: Convert each OCRed PDF to .docx
for f in ocr_*.pdf; do
  libreoffice --headless --convert-to docx "$f"
done

LibreOffice's headless PDF-to-DOCX conversion is reasonable on single-column text. For complex layouts the same caveats apply as with any free converter — plan for cleanup in Word afterwards.

When NOT to convert to Word

Three scenarios where the Word output is not the right destination:

  1. You only need to search the document. If the goal is "find every instance of the word X", OCR the PDF and stop there — the searchable PDF answers the question without the layout-reconstruction step. See How to make a PDF searchable with OCR for that workflow.
  2. You need a structured spreadsheet, not prose. If the scanned document is a table of data and you want it in Excel, use a converter built for tables (Adobe Acrobat Pro's PDF-to-Excel, ABBYY FineReader, or Cloud OCR with structured-extraction). Converting to Word and then copy-pasting into Excel loses column boundaries.
  3. The document needs to remain visually identical. Word conversion always changes appearance. For documents that must look identical to the original (signed contracts, legal exhibits), keep the PDF and edit a separate cover document rather than reconstructing the original.

Frequently asked questions

Why is a scanned PDF harder to convert to Word than a regular PDF?
Because a scanned PDF has no text — only pixels. A regular PDF exported from Word, Google Docs, or LibreOffice contains characters embedded as text in the file structure; converting it to Word is mostly a re-flow problem. A scanned PDF needs OCR first to recognise characters from pixels, then the recognised text needs to be laid out as Word paragraphs, tables, and headings. Two error-prone steps instead of one, and the layout-reconstruction step is the harder of the two. The most-faithful conversion engines (Adobe Acrobat Pro, ABBYY FineReader) spend most of their effort on layout reconstruction, not the OCR step itself.
How accurate is the conversion in 2026?
For clean 300 DPI scans of single-column printed text, character-level OCR accuracy is typically 95–99 percent and the resulting Word document is usable as-is for most editing purposes. As layout complexity increases (multi-column briefs, tables, footnotes, mixed images), accuracy degrades sharply. For litigation-quality conversions where every character matters, plan to spend roughly as much time on post-conversion review as on the conversion itself. For "I just need to fix a typo in a contract" the conversion is usually clean enough to work with directly.
Should I OCR the scan first, or use a "scan-to-Word" tool that does both steps?
Either works. Doing OCR first (with a dedicated OCR tool like ocrmypdf or ScoutMyTool's PDF OCR) gives you a searchable PDF as an intermediate artefact, which is independently useful. Doing both steps in one tool (Adobe Acrobat Pro, ABBYY FineReader) is faster and produces slightly tighter integration between the OCR text and the Word layout. For one-off conversions, one-tool is fine; for batch pipelines where you want the searchable PDF as a separate output, do OCR first and then convert.
Will Word be able to preserve the tables in the scanned document?
Depends on the table style and the converter. Tables with visible borders (ruled tables) are recognised well by Adobe Acrobat Pro and ABBYY FineReader — they emit proper Word table structures with the right number of rows and columns. Tables without borders (cells separated only by whitespace alignment) are much harder; the converter has no visible signal telling it where the cell boundaries are. For unruled tables, expect to recreate the table structure manually in Word. ScoutMyTool, Tesseract+LibreOffice, and Google Docs all struggle on complex tables; ABBYY and Adobe Pro do better but are not perfect.
Is my scanned PDF uploaded when I use a free converter?
It depends on the tool. ScoutMyTool's PDF-to-Word tool runs entirely in your browser tab using Tesseract.js (for the OCR step) and a client-side .docx writer — nothing transits a server. Google Docs uploads the PDF to Google Drive and runs OCR + conversion there. Adobe Acrobat Pro and ABBYY FineReader run locally on your desktop. For confidential scanned documents (medical records, legal exhibits, HR files), the client-side or desktop options are the right choice; server-uploading converters carry a privacy cost worth being deliberate about.
Can I batch-convert hundreds of scanned PDFs?
Yes, but the right tool depends on your environment. On Linux / macOS, the command-line stack — ocrmypdf for the OCR pass, then LibreOffice headless conversion (libreoffice --headless --convert-to docx) — handles batches of thousands per machine without licensing constraints. On Windows, ABBYY FineReader has a batch processor; Adobe Acrobat Pro has Action Wizard. For cloud scale, Google Document AI or AWS Textract scale horizontally at per-page cost. Avoid free web converters for batch work — they will rate-limit you and each conversion adds upload time.
Will custom fonts in the scanned document survive into Word?
No. OCR recognises characters as character codes (the Unicode value of each letter) but does not preserve the original font. The resulting Word document uses your default Word font for the recognised text. The visual appearance will differ from the original scan — same words, different font. If preserving the typeface matters, embed the original font into Word manually after conversion. For most uses, the visual shift does not matter; for branded marketing documents or legal exhibits where formatting is part of the record, expect to apply the original font in post.

Convert your scanned PDF to Word, free

Browser-based OCR + .docx output. Nothing is uploaded. Best on single-column documents; review carefully for complex layouts.

Open the free PDF to Word tool →

References

  1. ISO 32000-1:2008, Document management — Portable document format — Part 1: PDF 1.7. Public reference copy: opensource.adobe.com PDF32000_2008. Distinction between text content streams (§9) and image-only page content (§8.9) is the foundation for why OCR is needed.
  2. ISO/IEC 29500-1:2016, Information technology — Document description and processing languages — Office Open XML File Formats — Part 1: Fundamentals and Markup Language Reference. iso.org standard 71691 (accessed May 2026). The WordProcessingML schema that target .docx output must conform to.
  3. Tesseract OCR project (Google, formerly HP), Tesseract User Manual. tesseract-ocr.github.io (accessed May 2026). The open-source OCR engine that powers ScoutMyTool, ocrmypdf, and most free OCR-to-Word pipelines.