Why convert PDF to Word before translating instead of translating directly?

CAT (computer-aided translation) tools work natively on DOCX, not PDF. Converting PDF to Word produces an editable source the CAT tool can segment by sentence, present alongside translation memory matches, and let you translate cell-by-cell. Direct PDF translation tools exist (Adobe's in-app, Google Translate Documents) but they typically discard formatting and produce a translated copy without the segmentation that makes CAT work efficient. For professional translation with translation memory and term management, always convert to DOCX first, translate, then re-export to PDF.

How do I preserve PDF layout after translating to a longer language?

Many language pairs change text length significantly: English → German typically grows 20–30%; English → Spanish 15–25%; English → Russian 15–20%. The translated text overruns the original layout. Three mitigations. First, design the source with text expansion budget — leave white space, avoid tight text boxes. Second, accept layout reflow during translation and re-layout post-translation rather than trying to force the translation into the original space. Third, use translation memory and shorter-translation suggestions when meaning permits — sometimes 5 words conveys what was 7 words in the source. For high-stakes localisation, work with the designer on post-translation re-layout as a separate billable step.

How do I OCR a scanned PDF in the source language before translating?

Use Tesseract with the matching language pack — Tesseract supports 100+ languages with separate language data files. Common pairs (English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Arabic, Chinese, Japanese) ship as standard packs. For best accuracy, ensure the source scan is 300+ DPI, deskewed, and high-contrast. Quality OCR pre-processing is what determines downstream translation quality — garbled OCR produces nonsense source that no amount of translation skill can recover. ScoutMyTool Make PDF Searchable supports pre-bundled language packs; OCRmyPDF on desktop covers the full Tesseract language set.

What about machine translation as a starting point — DeepL, Google, GPT?

Modern MT (DeepL, Google Translate, GPT-4) produces drafts of meaningful quality for high-resource language pairs (English ↔ most European languages, English ↔ Chinese / Japanese / Korean). Use as a first pass to be post-edited by a human translator (MTPE — machine translation post-editing). For low-resource pairs (English ↔ less-common languages, less-common pair-to-pair like Tagalog ↔ Welsh), MT output is rougher and needs more rework. Always disclose MT usage to the client if the client expects fully-human translation; some industries (legal, medical) prohibit MT or require explicit consent.

How do I handle RTL languages (Arabic, Hebrew) in a translated PDF?

Layout direction inverts. Single-language Arabic / Hebrew PDFs use right-aligned text, right-to-left reading order, page numbers on the left. Bilingual PDFs (Arabic-English contracts, for example) place Arabic on the right page and English on the left in a double-page spread. Bidi-aware fonts (Noto Naskh Arabic, Cairo) handle mixed-direction text automatically when paragraphs are tagged for direction. Word and InDesign support RTL layout in their Middle Eastern editions; Google Docs handles basic RTL but with rough edges on complex layouts. For professional Arabic / Hebrew PDF translation, an Arabic-edition design tool is the right tool, not the standard Western edition.

PDF for translators — preserve…

6 min read

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

PDF source is the most-resented input format in professional translation. The file is locked, segmentation tools cannot read it natively, image-only scans need OCR before any translation can start, and the layout almost always needs rework after the translation because target-language text fits differently. The workflow has become standard over years of practice: extract to editable format, translate in CAT tool, re-layout, export back to PDF. This article maps the workflow stages, the CAT tool choices that affect cost and capability, and the language-specific considerations (text expansion, RTL, OCR language-packs) that decide quality.

CAT tools — feature and cost comparison

Tool	Cost	Best for
SDL Trados Studio	€695 one-time + upgrades	Industry-default CAT tool; large agency workflows
memoQ	€620 one-time + maintenance	Project managers; team-translation workflows
Wordfast Pro	€400 one-time	Solo translators; cost-conscious workflows
OmegaT (open source)	Free	Budget; basic CAT features; limited PDF handling
Smartcat (browser SaaS)	Free tier + per-word paid	Cloud collaboration; no installation
Crowdin / Phrase (platform)	Subscription	Software localisation, not document translation primarily

Step by step — translate a PDF document

OCR if scanned. Use ScoutMyTool Make PDF Searchable with the matching source-language pack to add a text layer to the PDF.
Convert PDF to DOCX using PDF to Word with structure preservation enabled.
Import DOCX into CAT tool (Trados, memoQ, Wordfast, OmegaT). Segment by sentence; apply translation memory if available.
Translate segment by segment, using the CAT tool's TM matches and terminology database. Run QA checks (consistency, numerical accuracy) on completion.
Export DOCX from CAT tool, re-layout in Word for target-language text fit, export back to PDF. Add target-language metadata (Title field in target language).

Post-translation QA checklist

Six checks before delivery. First, every segment translated (no leftover source language). Second, numbers and dates preserved (translated currency symbols and date formats per locale; numerical values unchanged). Third, proper nouns consistently rendered (names, places, brand terms). Fourth, formatting preserved (bold, italic, headings carry over to target). Fifth, hyperlinks still functional (URLs preserved, hover-tooltips translated if visible). Sixth, layout sensible (no text overflowing boxes, no broken pagination, no mis-aligned tables). Each check takes a few minutes; together they catch the issues that cause client revisions.

For high-stakes translation (legal contracts, medical labelling, regulatory filings), a second human reviewer (revision pass) is standard practice — one translator translates, another reviews. The two-step process catches errors a single-pass workflow misses. The cost is roughly double; the quality differential is meaningful enough that most professional agencies include the revision pass as standard.

Working with embedded images and figures

Diagrams, charts, and illustrations in translated PDFs often contain text that needs separate translation — labels on a flowchart, axis labels on a chart, callouts on a diagram. Three approaches. First, source the original graphic files (SVG, AI, PSD) from the client and re-create with translated labels — best fidelity but most effort. Second, edit labels in-place using a PDF editor that supports text editing inside images (Acrobat Pro's OCR + edit) — moderate effort, OK quality. Third, leave images as-is and provide a translation key as a separate document — minimum effort, worst recipient experience. For client-facing translated deliverables, the first approach usually pays back; for internal documents, the second is fine.

For client-facing translation packages, deliver three artefacts together: the translated PDF, the translation memory (TMX file from the CAT tool — lets the client re-use translations in future projects), and a translator's notes document covering any choices that were not obvious (preserved English brand names, untranslatable cultural references, regional-variant decisions). The bundle signals professional rigour and is what serious agencies provide as standard. Solo translators competing with agencies should match this delivery standard to compete on quality, not just price.

FAQ

Why convert PDF to Word before translating instead of translating directly?: CAT (computer-aided translation) tools work natively on DOCX, not PDF. Converting PDF to Word produces an editable source the CAT tool can segment by sentence, present alongside translation memory matches, and let you translate cell-by-cell. Direct PDF translation tools exist (Adobe's in-app, Google Translate Documents) but they typically discard formatting and produce a translated copy without the segmentation that makes CAT work efficient. For professional translation with translation memory and term management, always convert to DOCX first, translate, then re-export to PDF.
How do I preserve PDF layout after translating to a longer language?: Many language pairs change text length significantly: English → German typically grows 20–30%; English → Spanish 15–25%; English → Russian 15–20%. The translated text overruns the original layout. Three mitigations. First, design the source with text expansion budget — leave white space, avoid tight text boxes. Second, accept layout reflow during translation and re-layout post-translation rather than trying to force the translation into the original space. Third, use translation memory and shorter-translation suggestions when meaning permits — sometimes 5 words conveys what was 7 words in the source. For high-stakes localisation, work with the designer on post-translation re-layout as a separate billable step.
How do I OCR a scanned PDF in the source language before translating?: Use Tesseract with the matching language pack — Tesseract supports 100+ languages with separate language data files. Common pairs (English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Arabic, Chinese, Japanese) ship as standard packs. For best accuracy, ensure the source scan is 300+ DPI, deskewed, and high-contrast. Quality OCR pre-processing is what determines downstream translation quality — garbled OCR produces nonsense source that no amount of translation skill can recover. ScoutMyTool Make PDF Searchable supports pre-bundled language packs; OCRmyPDF on desktop covers the full Tesseract language set.
What about machine translation as a starting point — DeepL, Google, GPT?: Modern MT (DeepL, Google Translate, GPT-4) produces drafts of meaningful quality for high-resource language pairs (English ↔ most European languages, English ↔ Chinese / Japanese / Korean). Use as a first pass to be post-edited by a human translator (MTPE — machine translation post-editing). For low-resource pairs (English ↔ less-common languages, less-common pair-to-pair like Tagalog ↔ Welsh), MT output is rougher and needs more rework. Always disclose MT usage to the client if the client expects fully-human translation; some industries (legal, medical) prohibit MT or require explicit consent.
How do I handle RTL languages (Arabic, Hebrew) in a translated PDF?: Layout direction inverts. Single-language Arabic / Hebrew PDFs use right-aligned text, right-to-left reading order, page numbers on the left. Bilingual PDFs (Arabic-English contracts, for example) place Arabic on the right page and English on the left in a double-page spread. Bidi-aware fonts (Noto Naskh Arabic, Cairo) handle mixed-direction text automatically when paragraphs are tagged for direction. Word and InDesign support RTL layout in their Middle Eastern editions; Google Docs handles basic RTL but with rough edges on complex layouts. For professional Arabic / Hebrew PDF translation, an Arabic-edition design tool is the right tool, not the standard Western edition.

Citations

ISO 17100 — Translation services requirements standard.
ISO 18587 — Post-editing of machine translation output requirements.
SDL Trados — CAT tool feature documentation.
Tesseract OCR — language pack documentation.

PDF-to-Word for translation in your browser

ScoutMyTool PDF to Word converts client-side. Source files stay on your machine through the CAT-tool prep step.

Open PDF-to-Word →

PDF for translators — preserve formatting across languages