7 min read
PDF for translators — preserve segments for CAT tools
By ScoutMyTool Editorial Team · Last updated: 2026-05-21
When a client sends me a PDF to translate, my heart sinks a little, because I know the first hour will go not to translating but to fighting the file. Drop a raw PDF extraction into a CAT tool and you watch single sentences shatter into five or six broken segments, words split by stray hyphens, and page headers landing in the middle of paragraphs. Every one of those breaks wrecks the sentence-level matching that translation memory depends on, so your reuse collapses and your consistency suffers. The fix is not a better CAT tool — it is preparing the PDF properly first. This guide is how I get clean, correctly segmented text out of a PDF so the CAT tool can do its job, and how I hand the translation back in the format the client actually wants.
What goes wrong — and why
| Problem | Cause | Fix |
|---|---|---|
| One sentence split across many segments | Hard line breaks at every visual line | Reflow text so segments break on sentences, not lines |
| Words broken by hyphens | End-of-line hyphenation in the layout | De-hyphenate joined words before import |
| Columns interleaved into nonsense | Reading order follows layout, not logic | Extract column by column in correct reading order |
| Headers/footers mixed into the text | Running heads repeat on every page | Strip repeating page furniture before segmenting |
| No text at all to extract | The PDF is a scan (images only) | Run OCR first, then clean and segment |
| Word count wrong for the quote | Junk text inflates the count | Clean first, then count for an accurate quote |
Step by step — prep a PDF for translation
- Check whether there is real text. If you cannot select the text, the PDF is a scan — run OCR first to create a text layer, and review the output for misread characters before going further.
- Extract to plain text. Pull the text out of the PDF, reading multi-column layouts column by column in the correct order rather than straight across the page.
- Reflow to sentence boundaries. Join the visual lines back into continuous paragraphs so segments will break on sentences, not on wherever a line happened to wrap.
- De-hyphenate and strip page furniture. Rejoin words split by end-of-line hyphens, and remove repeating headers, footers, and page numbers that would otherwise pollute your segments.
- Count for the quote. Run your source word count on the cleaned text, so the estimate reflects genuine translatable content, not extraction junk.
- Import, translate, and deliver in the client’s format.Bring the clean text into your CAT tool with your termbase and memory, translate, then deliver in the format the client needs — usually an editable source file rather than the PDF.
The honest caveat
None of this turns a PDF into a good source format — it just makes a bad one workable. The cleanest possible outcome is to ask the client for the original editable file (the Word document, the InDesign export, the source the PDF was made from), because that sidesteps extraction entirely and usually carries structure your CAT tool can use directly. It is always worth asking before you start. When the PDF really is all that exists — and for many legal, certified, and archival documents it is — then the prep workflow above is the difference between a job that flows and a job that fights you the whole way. Get the segments right at the start and everything downstream, from memory reuse to final consistency, falls into place.
Related reading
- PDF for translators — preserve formatting: keeping layout intact across languages.
- Convert PDF to clean text: strip junk and fix line wraps before segmenting.
- OCR explained: turning a scanned PDF into extractable text.
- Convert PDF to Word: an editable target for delivery back to the client.
- Multilingual PDFs: handling multiple languages and scripts in one document.
- Edit a scanned PDF: working with image-only source documents.
FAQ
- Why is a PDF such a difficult source format for translators?
- Because a PDF describes how text looks on a page, not how it reads as language — and translation is about language. When you extract text from a PDF, you typically get it broken at every visual line, with words hyphenated across line ends, columns interleaved, and running headers and footers dropped into the middle of paragraphs. None of that matters to a human reader of the original, but it is poison for a translation workflow: a CAT (computer-assisted translation) tool segments text into sentences, and if your source is already shattered into line-fragments, the segmentation is wrong, your translation-memory matches collapse, and you spend the job fighting the file instead of translating. A PDF is genuinely one of the worst source formats a translator can be handed, and the first real task is almost always getting clean, properly segmented text out of it.
- What does "preserving segments" actually mean?
- A segment is the unit a CAT tool works in — usually a sentence. Preserving segments means that when you extract a PDF’s text, each sentence comes through as one continuous unit rather than being chopped into the visual lines it happened to occupy on the page. This matters for two reasons. First, accuracy: a translator works sentence by sentence, and a sentence split across five segments is five times the friction and a magnet for inconsistency. Second, reuse: translation memory stores and re-offers past translations at the segment level, so if your segments are line-fragments rather than sentences, you get almost no useful matches now and contaminate the memory for every future job. Clean segmentation is therefore not cosmetic — it is what makes the whole CAT workflow function.
- How do I get clean, segmentable text out of a PDF?
- Extract to text, then clean before you import into your CAT tool. The key cleanup steps are: reflow the text so paragraphs are continuous and line breaks fall on real sentence boundaries rather than at every visual line; de-hyphenate words that were split across line ends; remove repeating headers, footers, and page numbers; and, for multi-column layouts, make sure the text is read column by column in the correct order rather than across the columns. If the PDF is a scan with no real text, you must run OCR first to create a text layer, then do the same cleanup — and check OCR output carefully, since misrecognised characters are common. Only once the text reads as clean, continuous prose should you bring it into the CAT tool to be segmented.
- Should I translate directly in the PDF?
- Almost never. PDFs are fixed-layout, so editing text directly inside one is fiddly, and translated text rarely fits the original’s space — many languages expand or contract relative to the source, so the layout breaks the moment you replace the words. The professional workflow is to get the text out of the PDF, translate it in a proper CAT environment with your termbase and translation memory, and then deliver in whatever format the client needs — often the original editable source (Word, etc.) rather than the PDF. Treat the PDF as a read-only source to extract from, not a workspace to translate in. The exception is a tiny job — a few lines on a certificate — where the overhead of extraction is not worth it.
- How do I quote accurately from a PDF source?
- Clean the text before you count it. Translators usually quote by source word count, and a raw PDF extraction is full of junk — repeated headers and footers, page numbers, broken fragments — that inflates the count and produces a quote that is wrong in the client’s favour or yours. The right order is to extract, clean (strip page furniture, reflow, de-hyphenate), and only then run the word count on the genuine translatable text. For scanned PDFs, OCR first or you will be counting nothing at all. An accurate count protects both sides and avoids the awkward mid-project conversation about scope, so it is worth the few extra minutes of cleanup before you send the estimate.
- Is it safe to process a client’s confidential PDF online?
- Only if the tool runs on your own device. Translators routinely handle confidential and even legally privileged material — contracts, patents, medical records, personal data — and many online PDF tools upload the file to a third-party server to process it, which can breach the confidentiality clause in your client agreement and data-protection rules. Client-side (in-browser) tools extract, clean, and convert locally so the file never leaves your computer — ScoutMyTool’s PDF tools work this way. For any client document, confirm the tool is client-side before uploading, or use offline software, and treat the source file with the same confidentiality you owe the translation itself.
Citations
Get clean text out of a PDF — in your browser
ScoutMyTool’s PDF text tools extract and clean a PDF’s text client-side, so a confidential client document never leaves your computer — then import the clean, segmentable result into your CAT tool.
Open PDF-to-Text tool →