Why is a PDF such a difficult source format for translators?

Because a PDF describes how text looks on a page, not how it reads as language — and translation is about language. When you extract text from a PDF, you typically get it broken at every visual line, with words hyphenated across line ends, columns interleaved, and running headers and footers dropped into the middle of paragraphs. None of that matters to a human reader of the original, but it is poison for a translation workflow: a CAT (computer-assisted translation) tool segments text into sentences, and if your source is already shattered into line-fragments, the segmentation is wrong, your translation-memory matches collapse, and you spend the job fighting the file instead of translating. A PDF is genuinely one of the worst source formats a translator can be handed, and the first real task is almost always getting clean, properly segmented text out of it.

What does "preserving segments" actually mean?

A segment is the unit a CAT tool works in — usually a sentence. Preserving segments means that when you extract a PDF’s text, each sentence comes through as one continuous unit rather than being chopped into the visual lines it happened to occupy on the page. This matters for two reasons. First, accuracy: a translator works sentence by sentence, and a sentence split across five segments is five times the friction and a magnet for inconsistency. Second, reuse: translation memory stores and re-offers past translations at the segment level, so if your segments are line-fragments rather than sentences, you get almost no useful matches now and contaminate the memory for every future job. Clean segmentation is therefore not cosmetic — it is what makes the whole CAT workflow function.

How do I get clean, segmentable text out of a PDF?

Extract to text, then clean before you import into your CAT tool. The key cleanup steps are: reflow the text so paragraphs are continuous and line breaks fall on real sentence boundaries rather than at every visual line; de-hyphenate words that were split across line ends; remove repeating headers, footers, and page numbers; and, for multi-column layouts, make sure the text is read column by column in the correct order rather than across the columns. If the PDF is a scan with no real text, you must run OCR first to create a text layer, then do the same cleanup — and check OCR output carefully, since misrecognised characters are common. Only once the text reads as clean, continuous prose should you bring it into the CAT tool to be segmented.

Should I translate directly in the PDF?

Almost never. PDFs are fixed-layout, so editing text directly inside one is fiddly, and translated text rarely fits the original’s space — many languages expand or contract relative to the source, so the layout breaks the moment you replace the words. The professional workflow is to get the text out of the PDF, translate it in a proper CAT environment with your termbase and translation memory, and then deliver in whatever format the client needs — often the original editable source (Word, etc.) rather than the PDF. Treat the PDF as a read-only source to extract from, not a workspace to translate in. The exception is a tiny job — a few lines on a certificate — where the overhead of extraction is not worth it.

How do I quote accurately from a PDF source?

Clean the text before you count it. Translators usually quote by source word count, and a raw PDF extraction is full of junk — repeated headers and footers, page numbers, broken fragments — that inflates the count and produces a quote that is wrong in the client’s favour or yours. The right order is to extract, clean (strip page furniture, reflow, de-hyphenate), and only then run the word count on the genuine translatable text. For scanned PDFs, OCR first or you will be counting nothing at all. An accurate count protects both sides and avoids the awkward mid-project conversation about scope, so it is worth the few extra minutes of cleanup before you send the estimate.

Is it safe to process a client’s confidential PDF online?

Only if the tool runs on your own device. Translators routinely handle confidential and even legally privileged material — contracts, patents, medical records, personal data — and many online PDF tools upload the file to a third-party server to process it, which can breach the confidentiality clause in your client agreement and data-protection rules. Client-side (in-browser) tools extract, clean, and convert locally so the file never leaves your computer — ScoutMyTool’s PDF tools work this way. For any client document, confirm the tool is client-side before uploading, or use offline software, and treat the source file with the same confidentiality you owe the translation itself.

PDF for translators — preserve segments…

7 min read

By ScoutMyTool Editorial Team · Last updated: 2026-05-21

When a client sends me a PDF to translate, my heart sinks a little, because I know the first hour will go not to translating but to fighting the file. Drop a raw PDF extraction into a CAT tool and you watch single sentences shatter into five or six broken segments, words split by stray hyphens, and page headers landing in the middle of paragraphs. Every one of those breaks wrecks the sentence-level matching that translation memory depends on, so your reuse collapses and your consistency suffers. The fix is not a better CAT tool — it is preparing the PDF properly first. This guide is how I get clean, correctly segmented text out of a PDF so the CAT tool can do its job, and how I hand the translation back in the format the client actually wants.

What goes wrong — and why

Problem	Cause	Fix
One sentence split across many segments	Hard line breaks at every visual line	Reflow text so segments break on sentences, not lines
Words broken by hyphens	End-of-line hyphenation in the layout	De-hyphenate joined words before import
Columns interleaved into nonsense	Reading order follows layout, not logic	Extract column by column in correct reading order
Headers/footers mixed into the text	Running heads repeat on every page	Strip repeating page furniture before segmenting
No text at all to extract	The PDF is a scan (images only)	Run OCR first, then clean and segment
Word count wrong for the quote	Junk text inflates the count	Clean first, then count for an accurate quote

Step by step — prep a PDF for translation

Check whether there is real text. If you cannot select the text, the PDF is a scan — run OCR first to create a text layer, and review the output for misread characters before going further.
Extract to plain text. Pull the text out of the PDF, reading multi-column layouts column by column in the correct order rather than straight across the page.
Reflow to sentence boundaries. Join the visual lines back into continuous paragraphs so segments will break on sentences, not on wherever a line happened to wrap.
De-hyphenate and strip page furniture. Rejoin words split by end-of-line hyphens, and remove repeating headers, footers, and page numbers that would otherwise pollute your segments.
Count for the quote. Run your source word count on the cleaned text, so the estimate reflects genuine translatable content, not extraction junk.
Import, translate, and deliver in the client’s format.Bring the clean text into your CAT tool with your termbase and memory, translate, then deliver in the format the client needs — usually an editable source file rather than the PDF.

The honest caveat

None of this turns a PDF into a good source format — it just makes a bad one workable. The cleanest possible outcome is to ask the client for the original editable file (the Word document, the InDesign export, the source the PDF was made from), because that sidesteps extraction entirely and usually carries structure your CAT tool can use directly. It is always worth asking before you start. When the PDF really is all that exists — and for many legal, certified, and archival documents it is — then the prep workflow above is the difference between a job that flows and a job that fights you the whole way. Get the segments right at the start and everything downstream, from memory reuse to final consistency, falls into place.

FAQ

Why is a PDF such a difficult source format for translators?: Because a PDF describes how text looks on a page, not how it reads as language — and translation is about language. When you extract text from a PDF, you typically get it broken at every visual line, with words hyphenated across line ends, columns interleaved, and running headers and footers dropped into the middle of paragraphs. None of that matters to a human reader of the original, but it is poison for a translation workflow: a CAT (computer-assisted translation) tool segments text into sentences, and if your source is already shattered into line-fragments, the segmentation is wrong, your translation-memory matches collapse, and you spend the job fighting the file instead of translating. A PDF is genuinely one of the worst source formats a translator can be handed, and the first real task is almost always getting clean, properly segmented text out of it.
What does "preserving segments" actually mean?: A segment is the unit a CAT tool works in — usually a sentence. Preserving segments means that when you extract a PDF’s text, each sentence comes through as one continuous unit rather than being chopped into the visual lines it happened to occupy on the page. This matters for two reasons. First, accuracy: a translator works sentence by sentence, and a sentence split across five segments is five times the friction and a magnet for inconsistency. Second, reuse: translation memory stores and re-offers past translations at the segment level, so if your segments are line-fragments rather than sentences, you get almost no useful matches now and contaminate the memory for every future job. Clean segmentation is therefore not cosmetic — it is what makes the whole CAT workflow function.
How do I get clean, segmentable text out of a PDF?: Extract to text, then clean before you import into your CAT tool. The key cleanup steps are: reflow the text so paragraphs are continuous and line breaks fall on real sentence boundaries rather than at every visual line; de-hyphenate words that were split across line ends; remove repeating headers, footers, and page numbers; and, for multi-column layouts, make sure the text is read column by column in the correct order rather than across the columns. If the PDF is a scan with no real text, you must run OCR first to create a text layer, then do the same cleanup — and check OCR output carefully, since misrecognised characters are common. Only once the text reads as clean, continuous prose should you bring it into the CAT tool to be segmented.
Should I translate directly in the PDF?: Almost never. PDFs are fixed-layout, so editing text directly inside one is fiddly, and translated text rarely fits the original’s space — many languages expand or contract relative to the source, so the layout breaks the moment you replace the words. The professional workflow is to get the text out of the PDF, translate it in a proper CAT environment with your termbase and translation memory, and then deliver in whatever format the client needs — often the original editable source (Word, etc.) rather than the PDF. Treat the PDF as a read-only source to extract from, not a workspace to translate in. The exception is a tiny job — a few lines on a certificate — where the overhead of extraction is not worth it.
How do I quote accurately from a PDF source?: Clean the text before you count it. Translators usually quote by source word count, and a raw PDF extraction is full of junk — repeated headers and footers, page numbers, broken fragments — that inflates the count and produces a quote that is wrong in the client’s favour or yours. The right order is to extract, clean (strip page furniture, reflow, de-hyphenate), and only then run the word count on the genuine translatable text. For scanned PDFs, OCR first or you will be counting nothing at all. An accurate count protects both sides and avoids the awkward mid-project conversation about scope, so it is worth the few extra minutes of cleanup before you send the estimate.
Is it safe to process a client’s confidential PDF online?: Only if the tool runs on your own device. Translators routinely handle confidential and even legally privileged material — contracts, patents, medical records, personal data — and many online PDF tools upload the file to a third-party server to process it, which can breach the confidentiality clause in your client agreement and data-protection rules. Client-side (in-browser) tools extract, clean, and convert locally so the file never leaves your computer — ScoutMyTool’s PDF tools work this way. For any client document, confirm the tool is client-side before uploading, or use offline software, and treat the source file with the same confidentiality you owe the translation itself.

Citations

Get clean text out of a PDF — in your browser

ScoutMyTool’s PDF text tools extract and clean a PDF’s text client-side, so a confidential client document never leaves your computer — then import the clean, segmentable result into your CAT tool.

Open PDF-to-Text tool →

PDF for translators — preserve segments for CAT tools