Why convert PDF to Markdown instead of plain text?

Because the structural information that makes a document readable — headings, lists, emphasis, quotes — is lost in a plain-text dump and preserved in Markdown. A research paper with eight section headings, three numbered lists, and twelve bold callouts becomes a wall of text in .txt; in Markdown the same content keeps its structure and is roughly as readable as the original. Markdown is also the input format for many downstream tools: static site generators (Hugo, Jekyll, Astro), documentation systems (Docusaurus, MkDocs), knowledge bases (Obsidian, Logseq), and LLM training pipelines that need structured input.

How does the tool detect headings?

Two cues combined. (a) Font size — headings are usually significantly larger than body text; the tool clusters font sizes used in the document and treats the top 3–4 clusters as candidate heading levels (H1, H2, H3, H4). (b) Style — bold-only or sans-serif-only paragraphs in an otherwise-mixed document are heading candidates. The mapping from PDF formatting to Markdown # / ## / ### levels uses the size hierarchy: the largest cluster maps to H1, next to H2, and so on, capped at H4 because deeper levels are rare in everyday documents.

What about lists — does the tool detect bullets and numbered items?

Yes. Bullet detection runs a pattern match against common bullet characters (•, *, -, –, –) at the start of paragraphs, and emits "- " (Markdown unordered list) for each. Numbered-list detection looks for "1.", "(1)", "(a)" prefixes and emits Markdown numbered-list syntax ("1. ..."). Indented sub-lists are preserved using Markdown indentation (4 spaces per level). Multi-line list items wrap correctly under the leading bullet rather than appearing as separate paragraphs.

Will it preserve bold and italic emphasis?

Bold yes, italic mostly. Bold is detected from the embedded font name (font names ending in "Bold", "Black", "Heavy" or with bold-weight flags). The tool wraps the bold run in Markdown ** markers. Italic detection is less reliable because italic styling sometimes uses a separate font (Times-Italic.ttf) and sometimes a font variant (style=italic in OpenType); the tool catches the font-name case but may miss the font-variant case. If italic emphasis is critical for downstream processing, plan to review the output for missed italics on a sample page.

Does it handle code blocks?

Yes, via two heuristics. Monospace-font runs longer than a few characters are wrapped in inline backticks. Multi-line monospace blocks (typically code listings in technical PDFs) are emitted as fenced code blocks ("```") preserving original indentation. The language tag in the fenced block is left empty unless the source PDF has a language hint embedded in the font choice (e.g. "Source Code Pro" for code listings) that the tool can pattern-match.

Will tables come out as Markdown tables?

Yes for simple tables, with caveats for complex ones. Simple tables (regular grid, no merged cells, clear column separators) convert to Markdown table syntax cleanly. Complex tables (merged cells, multi-row headers, embedded images) do not have a clean Markdown representation; the tool falls back to a "best effort" pipe-separated layout or, on the toggle "Tables as HTML", emits raw HTML <table> markup which most Markdown engines render. Pick the right mode based on the downstream renderer.

No. The conversion runs entirely in your browser using pdf.js for parsing and a Markdown-emitter built on top. Your file is loaded into a sandboxed memory buffer, processed locally, and the result is delivered as a .md download. Verify in DevTools Network — zero outbound requests. Important for PDFs containing proprietary content (internal documentation, draft manuscripts) where uploading to a cloud converter defeats the privacy goal.

Convert PDF to Markdown — preserve headings and lists (2026)

6 min read

Convert PDF to Markdown — preserve headings and lists

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

After working with hundreds of users on document-import workflows, PDF-to-Markdown is the conversion that has grown fastest over the past five years — driven by static-site generators, knowledge bases, and LLM pipelines that all want structured Markdown input rather than messy plain text. The challenge is that PDF does not carry "this is a heading" or "this is a list" metadata; structure has to be inferred from visual formatting. Below is how that inference works, what it gets right, what it gets wrong, and the workflow for producing usable Markdown without manual cleanup.

PDF formatting → Markdown syntax mapping

PDF formatting cue	Markdown output
Largest heading font cluster	`# H1`
Second-largest heading font	`## H2`
Bold paragraph in body context	`### H3 (if cluster matches)`
Bullet character at line start	`- item`
Numbered "1." prefix	`1. item`
Bold font run inside body text	`bold`
Italic font run inside body text	`_italic_`
Monospace inline run	`code`
Multi-line monospace block	```\ncode block\n```
Simple table	`Pipe-separated Markdown table`
Hyperlink	`[text](url)`

Step-by-step: convert a PDF to Markdown

The ScoutMyTool tool lives at scoutmytool.com/pdf/pdf-to-text with the Markdown output toggle. Runs client-side — no upload, no signup, no quota.

Drop your PDF. Loads into a sandboxed memory buffer; nothing is uploaded.
Select Markdown as the output format.The tool runs the font-size clustering and emphasis detection passes that plain-text mode skips.
Configure structural detection.
- Heading depth (1–4) — how many font-size clusters to treat as headings.
- Tables mode — Markdown table syntax (simple tables) or HTML <table> (complex).
- Code block detection — on by default; recognises monospace runs.
Click Convert. Live progress per page; expect 5–15 seconds for typical documents.
Review the output. The tool shows a side-by-side: PDF on the left, Markdown on the right. Spot-check that headings, lists, and emphasis came through correctly.
Download the .md file. Open in any Markdown editor (Obsidian, Typora, VSCode, etc.) and verify rendering.
If structure detection got things wrong.Adjust the heading-depth setting and re-convert. Common case: documents with three font-size clusters that map cleanly to # / ## / ### but the tool detected four; reduce the depth and re-run.
If the PDF is scanned (image-only).Run OCR first via PDF OCR. The OCR output preserves font information that the Markdown converter needs to detect structure.

What gets lost (and what to do about it)

Page numbers and running headers.The tool detects and drops these by default (they are not content). Toggle "Preserve running headers" if you need them.
Footnotes. Emitted as inline-link-style references, with the footnote content at the end of the file. Most Markdown renderers display these correctly.
Complex table layouts. Multi-row headers and merged cells do not have a clean Markdown equivalent; use the HTML-table fallback mode for these.
Embedded images. Extracted to a sibling /images/ directory with Markdown image links inserted at the right position. Original resolution preserved.
Mathematical notation. Inline math with standard symbols (= + − × ÷ √) survives; LaTeX-style notation does not because PDFs do not carry LaTeX source. For math-heavy PDFs, consider a math-specific OCR like Mathpix.

Markdown itself is standardised by the CommonMark specification¹; the tool emits CommonMark-conformant output that renders consistently in every modern Markdown engine.

Frequently asked questions

Why convert PDF to Markdown instead of plain text?: Because the structural information that makes a document readable — headings, lists, emphasis, quotes — is lost in a plain-text dump and preserved in Markdown. A research paper with eight section headings, three numbered lists, and twelve bold callouts becomes a wall of text in .txt; in Markdown the same content keeps its structure and is roughly as readable as the original. Markdown is also the input format for many downstream tools: static site generators (Hugo, Jekyll, Astro), documentation systems (Docusaurus, MkDocs), knowledge bases (Obsidian, Logseq), and LLM training pipelines that need structured input.
How does the tool detect headings?: Two cues combined. (a) Font size — headings are usually significantly larger than body text; the tool clusters font sizes used in the document and treats the top 3–4 clusters as candidate heading levels (H1, H2, H3, H4). (b) Style — bold-only or sans-serif-only paragraphs in an otherwise-mixed document are heading candidates. The mapping from PDF formatting to Markdown # / ## / ### levels uses the size hierarchy: the largest cluster maps to H1, next to H2, and so on, capped at H4 because deeper levels are rare in everyday documents.
What about lists — does the tool detect bullets and numbered items?: Yes. Bullet detection runs a pattern match against common bullet characters (•, *, -, –, –) at the start of paragraphs, and emits "- " (Markdown unordered list) for each. Numbered-list detection looks for "1.", "(1)", "(a)" prefixes and emits Markdown numbered-list syntax ("1. ..."). Indented sub-lists are preserved using Markdown indentation (4 spaces per level). Multi-line list items wrap correctly under the leading bullet rather than appearing as separate paragraphs.
Will it preserve bold and italic emphasis?: Bold yes, italic mostly. Bold is detected from the embedded font name (font names ending in "Bold", "Black", "Heavy" or with bold-weight flags). The tool wraps the bold run in Markdown ** markers. Italic detection is less reliable because italic styling sometimes uses a separate font (Times-Italic.ttf) and sometimes a font variant (style=italic in OpenType); the tool catches the font-name case but may miss the font-variant case. If italic emphasis is critical for downstream processing, plan to review the output for missed italics on a sample page.
Does it handle code blocks?: Yes, via two heuristics. Monospace-font runs longer than a few characters are wrapped in inline backticks. Multi-line monospace blocks (typically code listings in technical PDFs) are emitted as fenced code blocks ("```") preserving original indentation. The language tag in the fenced block is left empty unless the source PDF has a language hint embedded in the font choice (e.g. "Source Code Pro" for code listings) that the tool can pattern-match.
Will tables come out as Markdown tables?: Yes for simple tables, with caveats for complex ones. Simple tables (regular grid, no merged cells, clear column separators) convert to Markdown table syntax cleanly. Complex tables (merged cells, multi-row headers, embedded images) do not have a clean Markdown representation; the tool falls back to a "best effort" pipe-separated layout or, on the toggle "Tables as HTML", emits raw HTML <table> markup which most Markdown engines render. Pick the right mode based on the downstream renderer.
Is my PDF uploaded?: No. The conversion runs entirely in your browser using pdf.js for parsing and a Markdown-emitter built on top. Your file is loaded into a sandboxed memory buffer, processed locally, and the result is delivered as a .md download. Verify in DevTools Network — zero outbound requests. Important for PDFs containing proprietary content (internal documentation, draft manuscripts) where uploading to a cloud converter defeats the privacy goal.

Convert your PDF to Markdown now — free, no signup, no upload

Heading hierarchy, list detection, emphasis, code blocks, tables. Runs entirely in your browser.

Open the PDF-to-Markdown tool at scoutmytool.com/pdf/pdf-to-text →