Convert PDF to HTML — preserve images and layout

Three approaches to PDF-to-HTML and how to pick the right one.

6 min read

Convert PDF to HTML — preserve images and layout

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

A client wanted their 60-page annual report republished as a series of web pages for SEO and accessibility. Their starting point was a beautifully designed PDF; the target was structured, indexable, responsive HTML. The conversion is not as simple as "Export as HTML" in any one tool — PDF and HTML have fundamentally different page models, and how to bridge them depends on what you need to keep (semantic structure for accessibility) and what you can lose (exact layout). This article maps the four working approaches, the trade-off each makes, and the right pick for the common publishing workflows.

Four approaches — what each preserves

ApproachPreservesLosesBest for
Text-only extractionWords; headings if taggedAll layout, images, columns, exact positioningCMS import, accessibility, content extraction
Semantic HTML reconstructionHeadings, paragraphs, lists, images, basic layoutExact positioning, fonts, complex multi-columnWeb publishing of reports and whitepapers
Layout-preserving HTMLVisual layout via absolute positioning + CSSReflowability, semantic structureForm-embedded archives where appearance matters
Pixel-perfect (PDF as image)Every visual detailSelectable text, accessibility, searchWhen fidelity matters absolutely (legal evidence display)

Step by step — semantic-reconstruction for web publishing

  1. Verify the source PDF has tagged headings. Acrobat: View → Show/Hide → Navigation Panes → Tags. Tagged PDFs convert to semantic HTML much better than untagged.
  2. Open ScoutMyTool PDF to HTML at scoutmytool.com/pdf/pdf-to-html and drop the file. Runs in browser.
  3. Pick semantic-reconstruction mode. The tool outputs HTML with proper h1/h2/h3, paragraph, list, and img tags.
  4. Review and clean the output. Fix any mis-detected headings; adjust image sizes; remove footer/header artefacts that bled through.
  5. Import to CMS or publish directly. Most CMS platforms accept HTML import. For static-site generation, drop the HTML into your content directory and let your build pipeline pick it up.

Common pitfalls when converting PDF to HTML

Five issues come up regularly. First, scanned-PDF source without OCR: the converter has no text to extract and produces empty or image-only HTML. Always OCR before conversion. Second, embedded images at huge resolution: the source PDF may have 600 DPI images for print; the HTML version should downsample to 150 DPI so pages load in reasonable time on the web. Third, broken internal links: cross-reference links in the source PDF (table of contents, footnotes) sometimes do not survive the conversion; verify each one after export. Fourth, multi-column layout flattening: a two-column research paper converts to single-column HTML in most tools; the reading order should follow column 1 fully then column 2, but naive conversion reads left-to-right across both columns line by line.

Fifth, missing semantic structure: an untagged source PDF produces <div>-only HTML with no headings, which fails both SEO and accessibility. Source PDFs tagged with heading styles (Acrobat: View → Tags) produce far better HTML; if the source is untagged, either re-tag the source first or accept that the HTML will need manual structure clean-up. Most CMS imports treat untagged HTML as a single blob and lose chapter navigation.

When to choose HTML vs keep the PDF

HTML for web publishing where readers will read inline, mobile experience matters, SEO indexing is a goal, and the content is not formally frozen. PDF for archival citation, regulatory submission, signed contracts, and print-distribution. The most common pattern: publish HTML as the canonical web version, link to a downloadable PDF for users who want a frozen copy. Both can coexist; the choice is which is the primary version, not which to use exclusively.

For internal documentation systems, a hybrid is common: author in Markdown, generate both HTML (for the wiki) and PDF (for offline distribution) from the same source on each build. This eliminates the conversion problem entirely — both outputs share a single source of truth, edits propagate to both, and there is no "which version is correct" ambiguity. Pandoc, MkDocs, Sphinx, and Astro all support this dual-output pattern. For solo and small-team documentation, the setup pays back within a few months.

Related reading

FAQ

Why convert PDF to HTML at all — is it not better to just link to the PDF?
Depends on the use case. HTML is responsive (works on phones), indexable by Google more reliably than PDF, faster to load, and more accessible. PDF is fixed-layout, frozen-version, and citable. If the content is destined for a website where users will read it inline (blog posts, knowledge-base articles, marketing content), HTML is the better target. If the content is a downloadable archive (research papers, whitepapers, regulatory filings) the PDF is the right format and a "View HTML version" link can complement it. Many publishers do both: HTML primary, PDF for download. The conversion is the bridge.
My converted HTML looks broken — what is going on?
Three common causes. First, the PDF was image-only (scanned) — there is no text to extract. OCR first, then convert. Second, the converter used absolute-positioning layout (every text run anchored to its original coordinate), which works at desktop sizes but breaks at narrow viewports. Use a semantic-reconstruction converter instead. Third, fonts referenced in the PDF were not embedded; the HTML references missing fonts and falls back to system defaults that change line-flow. Verify the source PDF has embedded fonts before conversion.
How do images survive the conversion?
Modern converters extract images from the PDF as PNG / JPEG files and embed them in the HTML as <img> tags. Vector graphics (charts, line drawings) can be extracted as SVG for scalable display, or rasterised to PNG for compatibility. Image quality settings during conversion control whether the output is original-resolution (faithful but heavy) or downsampled (web-friendly but soft). For web publishing, 150 DPI images strike the right balance for most use cases.
Can the resulting HTML be edited in a normal HTML editor?
Yes, if you pick a semantic-reconstruction converter. The output is structured HTML with paragraph, heading, list, and image tags — editable in any HTML editor or CMS. Layout-preserving converters produce HTML with absolute positioning and inline styles that is technically editable but practically hostile. For a content workflow where the HTML is the new canonical version (you will update it independently of the PDF), prefer semantic output. For an archival workflow where the HTML mirrors the PDF exactly and the PDF remains canonical, layout-preserving is fine.
Can I do this conversion in the browser without uploading the PDF?
Yes. ScoutMyTool PDF to HTML runs in the browser using pdf-lib and PDF.js. The conversion happens locally; your PDF never uploads. For server-side bulk conversion (thousands of PDFs in a batch), the open-source `pdf2htmlEX` (command-line) is the standard tool. For one-off conversion, the browser tool is faster to use and keeps the file on your machine — useful for confidential content where you do not want a third-party server in the loop.

Citations

  1. ISO 32000-1:2008 — "Document management — Portable document format" — tagged-PDF structure.
  2. WHATWG HTML Living Standard — target HTML semantics.
  3. WCAG 2.1 — Web Content Accessibility Guidelines — semantic HTML requirements.
  4. pdf2htmlEX — open-source PDF-to-HTML conversion tool.

PDF to HTML in your browser

ScoutMyTool PDF to HTML runs client-side. Source PDFs stay on your machine.

Open PDF-to-HTML tool →