PDF for translation agencies: segment management and terminology

Get editable text out of PDF source for CAT tools, manage terminology and glossaries, deliver multilingual PDFs with correct fonts, and reconstruct layout.

6 min read

PDF for translation agencies: segment management and terminology

By ScoutMyTool Editorial Team ยท Last updated: 2026-05-22

Introduction

For a translation agency, a PDF source file is a mixed blessing: it is what clients send, and it is the worst format to translate from, because CAT tools want clean, segmentable text and a PDF is fixed-layout output. So the PDF work in a translation workflow is mostly at the edges โ€” extracting clean text at intake, pulling terminology from PDF references, and rendering correct multilingual PDFs at delivery. This guide covers that workflow: recovering translatable text from PDF source, managing glossaries and term bases, getting target-language fonts and layout right, and keeping a multilingual projectโ€™s files organised. (For one-off PDF translation, see the companion translate guides.)

Where PDF touches the workflow

StagePDF task
Intake (PDF source)Extract editable text for the CAT tool
PrepSegment; build/apply glossary & TM
TerminologyMaintain term base; ensure consistency
Delivery (target lang)Render PDF with correct fonts/scripts
LayoutReconstruct layout (text expansion, RTL)

Step by step โ€” PDF in a translation pipeline

  1. Ask for the editable source first. If the client has the original (Word, etc.), use it โ€” it avoids the whole extraction step.
  2. Extract clean text from PDF source. Convert with PDF to Word (OCR scans first with PDF OCR; see OCR + reformat), then fix line breaks and remove artifacts so it segments well.
  3. Extract terminology from references. Pull client glossaries/term lists into a term base with PDF to CSV; verify the pairs.
  4. Translate in your CAT tool. With clean segments, apply translation memory and the term base for consistency (the agencyโ€™s core process).
  5. Render the target-language PDF. Embed the target scriptโ€™s fonts and verify rendering โ€” see multilingual PDFs and translating a PDF.
  6. Reconstruct the layout. Accommodate text expansion/contraction and RTL in the editable file, then export the delivery PDF.
  7. Organise per project and language. Clear naming/versioning so the right source and target versions are never confused; process client files locally.

FAQ

Why is a PDF a difficult source for translation?
Because translation tools (CAT tools) work on editable, segmentable text, and a PDF is a fixed-layout output, not a clean source โ€” text may be split awkwardly across lines, mixed with layout artifacts, or (if scanned) not be real text at all. So a PDF is the worst common starting point for translation, and the first job is recovering clean, editable text from it. Where possible, ask the client for the original editable source (Word, etc.) instead of the PDF, since that avoids the extraction step entirely. When you only have the PDF, you extract the text as cleanly as you can and accept some preparation work before it is ready for the CAT tool.
How do I get translatable text out of a PDF?
Convert the PDF to an editable format (Word/text) to recover the content, or OCR it first if it is a scan (no real text otherwise), then clean up the extracted text โ€” fix broken line breaks, remove headers/footers and artifacts, and structure it โ€” so it segments well in your CAT tool. The cleaner the extraction, the less manual preparation and the better the translation-memory leverage. Verify OCR'd text carefully, since errors propagate into the translation. This extraction-and-cleanup is the unglamorous but essential front end of translating PDF source; investing in clean text up front pays off across the whole job.
How do PDFs fit into terminology and glossary management?
Terminology consistency is central to quality, and PDFs intersect with it in two ways: client reference materials and existing glossaries often arrive as PDFs you need to extract terms from (extract a glossary table to a spreadsheet/term base rather than re-keying), and you may deliver bilingual glossaries or term bases as PDFs. So extract terminology from PDF references into structured data your term base can use, and produce clean glossary PDFs for delivery or client review. Accurate term extraction (verify the pairs) keeps your term base reliable, which is what drives consistent terminology across a project and across translators.
How do I deliver a translated PDF with correct fonts and scripts?
When the deliverable is a PDF in the target language, the fonts for that language's script must be embedded so it renders correctly everywhere โ€” a translation into a language whose font is not embedded will show as substituted glyphs or boxes on the client's machine. For non-Latin scripts (Arabic, Chinese, Cyrillic, Indic) and right-to-left languages, check that the text displays and flows correctly. So embed the appropriate fonts and verify the rendering of the target-language text before delivery. Getting the multilingual rendering right is a frequent failure point, and embedding plus a visual check on the actual target text prevents the client receiving a broken-looking translation.
How do I reconstruct the layout in the target language?
Translation changes text length โ€” many languages expand significantly over English, others contract, and some run right-to-left โ€” so the target-language document rarely fits the source layout exactly. Reconstructing the layout (in the source format, then exporting PDF) involves accommodating text expansion/contraction, adjusting for RTL where needed, and keeping the design intact. This desktop-publishing step is part of delivering a polished translated document and is easier from a clean editable source than from a PDF. So the layout work happens in the editable file; the PDF is the final rendered deliverable. Budget for it, especially for expansion-heavy language pairs and complex designs.
How do I keep project files organised across languages?
Translation projects multiply files โ€” source, extracted text, target-language versions, glossaries, references โ€” so organise per project and per language pair, with clear naming and versioning so you always know which is the current source and which target version is current. Keep client references and delivered files together. This matters more in translation than most fields because of the file multiplication across languages and revisions; a disorganised project is where the wrong version gets delivered or a glossary update gets missed. A consistent per-project, per-language structure keeps a multilingual project under control from intake to delivery.
Is it safe to handle client documents with an online tool?
Translation source documents are confidential client material (often under NDA), so prefer a tool that processes files locally. ScoutMyTool extracts text, OCRs, extracts glossary data, and checks font embedding entirely in your browser tab, so client documents never leave your machine. For NDA-covered client content, confirm the tool does not upload before using it.

Citations

  1. Wikipedia โ€” โ€œComputer-assisted translation,โ€ the CAT-tool workflow PDFs feed. en.wikipedia.org โ€” CAT
  2. Wikipedia โ€” โ€œTranslation memory,โ€ the leverage clean segments enable. en.wikipedia.org/wiki/Translation_memory
  3. Wikipedia โ€” โ€œTerminology,โ€ on term management for consistency. en.wikipedia.org/wiki/Terminology

Clean source in, polished translation out

Extract translatable text, pull terminology, and check multilingual rendering with ScoutMyToolโ€™s in-browser tools โ€” client documents never leave your machine.

Open PDF to Word โ†’