Why do most PDF-to-text tools mangle line breaks?

PDF stores text as positioned glyph runs, not as paragraphs with explicit line endings. When you read a PDF, your eyes group glyphs into lines and lines into paragraphs based on visual layout cues (vertical gap, horizontal alignment, indentation). Most converters skip that step and emit one glyph run per output line, which produces text that looks like the visual lines of the PDF — including all the mid-sentence line breaks that the PDF's line-wrapping forced. ScoutMyTool runs a heuristic that detects paragraph boundaries from line-gap statistics, then joins within-paragraph lines into single lines (preserving inter-word spaces), so the output reads like flowing prose rather than fixed-width terminal output.

Can I keep the visual line breaks instead?

Yes — toggle "Preserve visual line breaks". This emits one output line per PDF line, with no paragraph-joining heuristic. Use this when you need to feed the output into a downstream tool that processes lines independently (e.g. a script that reads a fixed-format report line-by-line). Use the default paragraph-aware mode for "give me the readable text" and downstream NLP / search use cases.

Does the tool extract text from scanned PDFs (image-only pages)?

No, not directly. PDF-to-text only extracts text that is present as text in the PDF — vector-style text drawn with text-positioning operators. Scanned PDFs are images of text; the bytes of the PDF do not contain any text strings to extract. For scanned PDFs, you first need OCR to convert the image text into PDF text. Run the PDF through Make PDF Searchable (OCR) and then come back to PDF-to-text. The two-tool workflow keeps OCR (slow, fallible) separate from extraction (fast, deterministic).

What encoding does the output .txt file use?

UTF-8 with a BOM, by default. UTF-8 is the only sensible encoding for arbitrary PDF content because PDFs can contain characters from any script (Greek symbols in scientific papers, Cyrillic in Russian sources, CJK characters in Asian documents, em-dashes and curly quotes everywhere). The BOM is added because Microsoft Notepad still misidentifies non-BOM UTF-8 as ANSI on Windows, which corrupts non-ASCII characters. If you need raw UTF-8 without BOM (for piping into a Unix tool that does not like BOMs), toggle "No BOM" before downloading.

Is my PDF uploaded to your servers?

No. Text extraction runs entirely in your browser using pdf.js — Mozilla's open-source PDF parser. The PDF is loaded into a sandboxed memory buffer, pdf.js walks the page content streams and assembles the text, and the resulting plain-text string is delivered as a downloadable .txt file. Verify in DevTools Network — no outbound requests during the operation. This matters when the PDF contains sensitive content (contracts, medical records, internal memos) and the text extraction would otherwise be done in the cloud by a competing tool.

Does the tool handle multi-column layouts (newspaper / academic-paper style)?

Yes — to a reasonable approximation. The heuristic detects multi-column layouts by looking at horizontal text-position clusters across the page and emits text in column-reading order (top-to-bottom within column 1, then top-to-bottom within column 2). The accuracy is good for standard two- and three-column academic layouts; complex magazine-style layouts with sidebars and pull-quotes may still produce some out-of-order text. For ground-truth ordering on a tricky layout, fall back to the "Preserve visual line breaks" mode and manually reorder the output.

How big can the PDF be?

No hard cap — extraction runs client-side. Memory cost is roughly 2–4 MB per million extracted characters; even a 10,000-page academic monograph extracts in well under a minute on a modern laptop. The output .txt file is typically 0.5–2% of the input PDF size — text is small once the binary overhead of PDF is stripped away.

PDF to text (.txt) converter — preserve…

8 min read

PDF to text (.txt) converter — preserve line breaks

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

After working with hundreds of users on PDF-extraction workflows, the recurring frustration with "PDF to text" tools is always the same: the output technically contains every word from the PDF, but it is unreadable. Mid-sentence line breaks because the PDF wrapped at the page edge. Random blank lines between every paragraph. Footnotes inlined into the body text at the bottom of every page. Two-column papers emitting their left column then their right column interleaved line-by-line. The text is "extracted" in a strict sense, but you would not paste it into a document. Below is the workflow that produces text you can actually paste somewhere.

Step-by-step: convert a PDF to clean plain text

The ScoutMyTool tool lives at scoutmytool.com/pdf/pdf-to-text. Runs client-side via pdf.js — no upload, no signup, no quota.

Drop your PDF. One file at a time. The file loads into a sandboxed memory buffer; nothing is uploaded. Confirm in DevTools Network for sensitive content.
Pick output style. Two main toggles:
- Paragraph-aware (default) — joins within-paragraph lines into single lines using a line-gap heuristic; produces flowing prose suitable for pasting into a document or feeding into an NLP tool.
- Preserve visual line breaks — emits one output line per PDF line, no joining; produces output that mirrors the PDF's visual layout, line-for-line.
Multi-column toggle. Off by default (auto-detect). The heuristic looks at horizontal-position clusters across all text on the page and decides whether the page is 1, 2, or 3 columns. If you know the PDF is multi-column and the auto-detect is wrong, force the column count manually.
Page range. Default is "all pages". To extract only part of the PDF, type a range like 5-12 or 3, 8, 12-15.
Encoding option. Default UTF-8 with BOM, for Windows-Notepad compatibility. Toggle "No BOM" for Unix tool chains that prefer raw UTF-8.
Click Convert. The tool walks every page, extracts the text content stream via pdf.js, runs the paragraph-detection heuristic if enabled, and writes the result to a .txt file. Output downloads automatically.
If your PDF is scanned (image-only). The tool will tell you "no extractable text found on N pages". OCR the file first via Make PDF Searchable, then re-run PDF-to-text.
If the PDF is password-protected. Unlock first via Unlock PDF.

How ScoutMyTool compares to Smallpdf, iLovePDF and PDF2Go

All four offer PDF-to-text conversion. The meaningful differences are around line-handling quality (paragraph-aware vs visual-line), multi-column support, and whether the file leaves your device.

Feature	ScoutMyTool	Smallpdf	iLovePDF	PDF2Go
Free unlimited	Yes	2 per day on free	1 file per task on free	Yes, up to 100 MB
No signup	Yes	Required after 2 tasks	Required for >50 MB	Yes
Paragraph-aware line joining	Yes (default)	No (visual lines only)	Limited	No (visual lines only)
Multi-column reading order	Yes (column-detect heuristic)	No	Limited	No
UTF-8 output (with optional BOM)	Yes	UTF-8	UTF-8	UTF-8
Files leave your device	No (client-side)	Yes (uploaded)	Yes (uploaded)	Yes (uploaded)
Speed (50-page text PDF)	< 5 s on a modern laptop	~8–15 s (incl. upload)	~10–20 s (incl. upload)	~12–25 s (incl. upload)

Line-handling and multi-column claims verified by running the same two test PDFs (one academic two-column paper, one single-column report) through each tool and diff-checking the outputs against the cleaned source text.

Why PDFs are structurally bad at storing text — and what the heuristics do about it

The PDF imaging model (ISO 32000-1 §9, "Text") describes text as a sequence of positioned glyph-run operators: put glyph X at position (a, b), put glyph Y at position (c, d), and so on¹. There is no "paragraph" concept in the format. There is not even a guaranteed reading order — the operators can appear in any order in the content stream, with positioning establishing the visual layout. A converter that wants to emit paragraph-shaped text has to infer the structure from positions.

ScoutMyTool's paragraph-detection heuristic works in three passes: (1) cluster positioned glyphs into lines based on a vertical-gap threshold; (2) cluster lines into paragraphs based on a larger vertical-gap threshold; (3) within each paragraph, join lines into a single output line, inserting a space at the boundary unless the preceding line ends with a hyphen (in which case the hyphen is removed and the lines are joined without a space, reconstructing the hyphenated word). The thresholds adapt per-document based on the median line-height observed.

For dense academic PDFs, this typically produces text that is indistinguishable from the original LaTeX source modulo formatting; for scanned-then-OCR'd PDFs, results vary with OCR quality but the paragraph-joining still helps because OCR engines like Tesseract emit per-line text that has the same mid-paragraph-break problem as native PDF.

Related PDF tools on ScoutMyTool

PDF to Text — the tool this guide is about.
PDF to Word — if you need editable text WITH formatting (.docx output rather than plain .txt).
Make PDF Searchable (OCR) — required first step for scanned PDFs.
PDF to Excel — table-aware extraction when the content is tabular.
Extract Images — for the photographs and charts embedded in the same PDF.
PDF Editor — for ground-truth manual extraction on tricky layouts.
Unlock PDF — required first if the source is password-protected.

Frequently asked questions

Why do most PDF-to-text tools mangle line breaks?: PDF stores text as positioned glyph runs, not as paragraphs with explicit line endings. When you read a PDF, your eyes group glyphs into lines and lines into paragraphs based on visual layout cues (vertical gap, horizontal alignment, indentation). Most converters skip that step and emit one glyph run per output line, which produces text that looks like the visual lines of the PDF — including all the mid-sentence line breaks that the PDF's line-wrapping forced. ScoutMyTool runs a heuristic that detects paragraph boundaries from line-gap statistics, then joins within-paragraph lines into single lines (preserving inter-word spaces), so the output reads like flowing prose rather than fixed-width terminal output.
Can I keep the visual line breaks instead?: Yes — toggle "Preserve visual line breaks". This emits one output line per PDF line, with no paragraph-joining heuristic. Use this when you need to feed the output into a downstream tool that processes lines independently (e.g. a script that reads a fixed-format report line-by-line). Use the default paragraph-aware mode for "give me the readable text" and downstream NLP / search use cases.
Does the tool extract text from scanned PDFs (image-only pages)?: No, not directly. PDF-to-text only extracts text that is present as text in the PDF — vector-style text drawn with text-positioning operators. Scanned PDFs are images of text; the bytes of the PDF do not contain any text strings to extract. For scanned PDFs, you first need OCR to convert the image text into PDF text. Run the PDF through Make PDF Searchable (OCR) and then come back to PDF-to-text. The two-tool workflow keeps OCR (slow, fallible) separate from extraction (fast, deterministic).
What encoding does the output .txt file use?: UTF-8 with a BOM, by default. UTF-8 is the only sensible encoding for arbitrary PDF content because PDFs can contain characters from any script (Greek symbols in scientific papers, Cyrillic in Russian sources, CJK characters in Asian documents, em-dashes and curly quotes everywhere). The BOM is added because Microsoft Notepad still misidentifies non-BOM UTF-8 as ANSI on Windows, which corrupts non-ASCII characters. If you need raw UTF-8 without BOM (for piping into a Unix tool that does not like BOMs), toggle "No BOM" before downloading.
Is my PDF uploaded to your servers?: No. Text extraction runs entirely in your browser using pdf.js — Mozilla's open-source PDF parser. The PDF is loaded into a sandboxed memory buffer, pdf.js walks the page content streams and assembles the text, and the resulting plain-text string is delivered as a downloadable .txt file. Verify in DevTools Network — no outbound requests during the operation. This matters when the PDF contains sensitive content (contracts, medical records, internal memos) and the text extraction would otherwise be done in the cloud by a competing tool.
Does the tool handle multi-column layouts (newspaper / academic-paper style)?: Yes — to a reasonable approximation. The heuristic detects multi-column layouts by looking at horizontal text-position clusters across the page and emits text in column-reading order (top-to-bottom within column 1, then top-to-bottom within column 2). The accuracy is good for standard two- and three-column academic layouts; complex magazine-style layouts with sidebars and pull-quotes may still produce some out-of-order text. For ground-truth ordering on a tricky layout, fall back to the "Preserve visual line breaks" mode and manually reorder the output.
How big can the PDF be?: No hard cap — extraction runs client-side. Memory cost is roughly 2–4 MB per million extracted characters; even a 10,000-page academic monograph extracts in well under a minute on a modern laptop. The output .txt file is typically 0.5–2% of the input PDF size — text is small once the binary overhead of PDF is stripped away.

Convert your PDF to text now — paragraph-aware, no signup, no upload

Smart line-joining, multi-column detection, UTF-8 output. Runs entirely in your browser — your PDF never leaves your device.

Open the free PDF-to-Text tool at scoutmytool.com/pdf/pdf-to-text →