Why does a plain PDF-to-text conversion lose all my headings and links?

Because a PDF stores no structure — it records characters at coordinates and draws shapes, with no tags saying "this is a heading" or "this is a link target." A naive text extraction just streams the characters in reading order, so headings become ordinary lines and links become bare text. Converting to good Markdown requires inferring structure that the PDF never declared: detecting headings from font size and weight, lists from bullet glyphs and indentation, and links from the PDF’s link annotations. That inference is the whole difficulty, and it is why a structure-aware converter beats a plain text dump for Markdown.

How are headings detected when converting to Markdown?

By visual cues, since there is no explicit tag. A converter measures the font size and weight of each text run relative to the body text: a large, bold line is likely an H1, a slightly smaller one an H2, and so on, mapping the visual hierarchy onto Markdown’s # levels. This works well for documents with consistent typographic styling and poorly for ones where headings are not visually distinct (same size, just bold, or all-caps body text). After conversion, skim the heading levels — promote or demote any the size heuristic misjudged, and add # marks where the cues were too flat to detect.

Can hyperlinks survive the conversion to Markdown?

They can, if the PDF stored them as real link annotations and the converter reads them. A properly authored PDF keeps clickable links as annotations with a target URL, which a good converter turns into Markdown [text](url) syntax. The catch: many PDFs print a URL as visible text without an underlying annotation, or lose the annotation in earlier processing — in those cases there is nothing structured to recover and you get the URL as plain text. Check links after conversion, and where the link target is missing, re-add it from the source. Links inside images are not recoverable at all without OCR plus manual work.

Why do tables come out broken in Markdown?

Because Markdown tables are simple grids, while PDF tables are just text positioned to look like a grid with no underlying cell structure. A converter has to infer columns from horizontal text positions and rows from vertical alignment, then map that onto Markdown pipe syntax — and any merged cells, wrapped text, or irregular spacing breaks the inference. For simple, regular tables it often works; for complex ones, expect to rebuild the table by hand or extract it separately to CSV and convert that. If tables are central to your document, treat them as a separate extraction job rather than expecting the Markdown pass to nail them.

What is the cleanest workflow for converting a PDF to Markdown?

Convert with a structure-aware tool, then do a short cleanup pass. First run the PDF through a converter that detects headings, lists, and links rather than a plain text extractor. Then review the output: check that heading levels match the document’s hierarchy, that lists are consistent, that code blocks are fenced, and that links resolve — fixing the handful of elements the heuristics missed. For documents you will reuse (documentation, notes, content migration), this convert-then-clean approach gets you usable Markdown far faster than retyping, and the cleanup shrinks as you learn which elements your source documents get right.

Is it safe to convert a confidential PDF to Markdown online?

Only if the conversion runs on your own device. Server-side converters upload your file to a remote machine, so a confidential document leaves your control and may be cached. Client-side (in-browser) tools do the parsing locally so the file never leaves your computer — ScoutMyTool’s PDF tools work this way. For sensitive material, confirm the converter is client-side before using it, or run an offline command-line converter such as a local Pandoc-based pipeline.

How to convert PDF to Markdown…

6 min read

By ScoutMyTool Editorial Team · Last updated: 2026-05-21

I tried to move a 60-page product manual from PDF into our docs site by copy-pasting, and what I got was a wall of undifferentiated text — every heading flattened, every link reduced to bare words, the careful structure gone. The realisation was that a PDF does not actually contain headings or links as structure; it contains characters drawn at positions, and "convert to Markdown" really means infer the structure the PDF never stored. Once I understood that, I knew what to ask of a converter and what I would always have to fix by hand. This guide explains the structure-detection problem, what survives the trip and what does not, and the convert-then-clean workflow that gets you usable Markdown.

What survives, and how it is detected

Element	Survives?	How it is detected	Fix if wrong
Headings (H1–H4)	Often	Font size / weight vs body text	Add # by hand where size cues were flat
Hyperlinks	Sometimes	PDF link annotations	Re-add [text](url) from the source
Bulleted lists	Usually	Leading bullet glyph + indent	Normalise stray bullets to -
Numbered lists	Usually	Leading 1. 2. pattern	Fix restarted numbering
Code blocks	Sometimes	Monospace font run	Wrap in ``` fences
Tables	Rarely cleanly	Column-position clustering	Rebuild as Markdown pipe table

Step by step — PDF to clean Markdown

Confirm the PDF has selectable text. Try to select a sentence. If it highlights, structure detection can work. If not, it is a scan — run OCR first to create a text layer, accepting that detected structure will be weaker on OCR output.
Use a structure-aware converter, not plain text export. Choose a tool that detects headings from font cues, lists from bullet and number patterns, and links from PDF annotations — not one that just dumps characters in reading order.
Review the heading hierarchy. Skim the # levels against the document’s real structure; promote or demote any the size heuristic misjudged, and add headings where the visual cues were too flat to detect.
Check links and code. Verify [text](url) links resolve and re-add any that came through as bare text; fence code blocks in triple backticks where the monospace run was not detected.
Rebuild tables if they matter. If the document has important tables, expect to rebuild them as Markdown pipe tables or extract them separately to CSV — the Markdown pass rarely nails complex tables.

Why convert-then-clean beats expecting perfection

Because structure is inferred rather than read, no converter is perfect, and chasing a one-click flawless result wastes more time than a short cleanup pass. The realistic mental model is: the converter gets you 80–95% of the way — correct paragraphs, most headings, most lists, the links that were stored as annotations — and you spend a few minutes fixing the specific elements your source document gets wrong. The better authored the source PDF (consistent heading styles, real link annotations), the less you fix. Treat the output as a strong draft to polish, not a finished artifact, and PDF-to-Markdown becomes a fast, repeatable step instead of a frustrating one.

FAQ

Why does a plain PDF-to-text conversion lose all my headings and links?: Because a PDF stores no structure — it records characters at coordinates and draws shapes, with no tags saying "this is a heading" or "this is a link target." A naive text extraction just streams the characters in reading order, so headings become ordinary lines and links become bare text. Converting to good Markdown requires inferring structure that the PDF never declared: detecting headings from font size and weight, lists from bullet glyphs and indentation, and links from the PDF’s link annotations. That inference is the whole difficulty, and it is why a structure-aware converter beats a plain text dump for Markdown.
How are headings detected when converting to Markdown?: By visual cues, since there is no explicit tag. A converter measures the font size and weight of each text run relative to the body text: a large, bold line is likely an H1, a slightly smaller one an H2, and so on, mapping the visual hierarchy onto Markdown’s # levels. This works well for documents with consistent typographic styling and poorly for ones where headings are not visually distinct (same size, just bold, or all-caps body text). After conversion, skim the heading levels — promote or demote any the size heuristic misjudged, and add # marks where the cues were too flat to detect.
Can hyperlinks survive the conversion to Markdown?: They can, if the PDF stored them as real link annotations and the converter reads them. A properly authored PDF keeps clickable links as annotations with a target URL, which a good converter turns into Markdown [text](url) syntax. The catch: many PDFs print a URL as visible text without an underlying annotation, or lose the annotation in earlier processing — in those cases there is nothing structured to recover and you get the URL as plain text. Check links after conversion, and where the link target is missing, re-add it from the source. Links inside images are not recoverable at all without OCR plus manual work.
Why do tables come out broken in Markdown?: Because Markdown tables are simple grids, while PDF tables are just text positioned to look like a grid with no underlying cell structure. A converter has to infer columns from horizontal text positions and rows from vertical alignment, then map that onto Markdown pipe syntax — and any merged cells, wrapped text, or irregular spacing breaks the inference. For simple, regular tables it often works; for complex ones, expect to rebuild the table by hand or extract it separately to CSV and convert that. If tables are central to your document, treat them as a separate extraction job rather than expecting the Markdown pass to nail them.
What is the cleanest workflow for converting a PDF to Markdown?: Convert with a structure-aware tool, then do a short cleanup pass. First run the PDF through a converter that detects headings, lists, and links rather than a plain text extractor. Then review the output: check that heading levels match the document’s hierarchy, that lists are consistent, that code blocks are fenced, and that links resolve — fixing the handful of elements the heuristics missed. For documents you will reuse (documentation, notes, content migration), this convert-then-clean approach gets you usable Markdown far faster than retyping, and the cleanup shrinks as you learn which elements your source documents get right.
Is it safe to convert a confidential PDF to Markdown online?: Only if the conversion runs on your own device. Server-side converters upload your file to a remote machine, so a confidential document leaves your control and may be cached. Client-side (in-browser) tools do the parsing locally so the file never leaves your computer — ScoutMyTool’s PDF tools work this way. For sensitive material, confirm the converter is client-side before using it, or run an offline command-line converter such as a local Pandoc-based pipeline.

Citations

Extract the text to start your Markdown

ScoutMyTool PDF to Text pulls the words out client-side — nothing uploaded — giving you a clean base to mark up. Then add the heading and link structure the PDF never stored.

Open PDF-to-Text tool →

How to convert PDF to Markdown (preserve headings + links)