Convert PDF to LaTeX — academic and technical papers

Convert PDFs back to LaTeX source for re-formatting and reuse.

6 min read

Convert PDF to LaTeX — academic and technical papers

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

Recovering LaTeX source from a PDF is the "reverse-engineer the cake from the slice" problem of academic publishing. LaTeX compiles to PDF deterministically, but the reverse is approximate — PDF stores positioned glyphs, not the LaTeX commands that produced them. For specific use cases (lost source recovery, re-publication under a new template, citation extraction), conversion tools produce usable starting points that need manual polish. This article maps the tool options, the math-equation accuracy realities, and the realistic expectations for what conversion delivers.

PDF-to-LaTeX tools compared

ToolOutputStrengthWeakness
Mathpix SnipLaTeX with rendered mathBest math equation OCR; SaaSPaid; uploads to vendor; per-page cost
pdf2tex (open source)LaTeX source approximationFree; works on born-digital PDFsLayout fidelity loose; math is rough
Adobe Acrobat — Save As DOCX, then pandocLaTeX via MarkdownTwo-step but uses widely-installed toolsHeavy formatting loss in the round-trip
Marker (mark-pdf-to-markdown)Markdown then LaTeX via pandocHigh-quality structure preservationHeavier setup; GPU helps
Hand-transcribe in LaTeXLaTeX source you wroteHighest fidelity; you control every detailTime-consuming; only worth it for re-publication
arXiv source downloadOriginal LaTeX (if available)100% fidelity for arXiv papersOnly works for arXiv-hosted papers

Step by step — recover LaTeX from a research-paper PDF

  1. Check arXiv and preprint repositories first. Search the paper title on arxiv.org, ChemRxiv, bioRxiv, SSRN. If the source is posted, download the .tar.gz archive directly. This is 100% fidelity recovery and skips all subsequent steps.
  2. OCR the PDF if it is a scan. Use ScoutMyTool Make PDF Searchable or OCRmyPDF to add a text layer. For math-heavy papers, OCR alone is insufficient for the equations; OCR gives you body-text extraction, and equations need a math-specific OCR pass.
  3. Run pandoc or pdf2tex for body text. pdf2tex outputs a rough LaTeX file. Alternative: convert PDF → Markdown via Marker, then Markdown → LaTeX via pandoc — the two-step often preserves structure better than direct conversion.
  4. Convert equations separately with Mathpix Snip. Take screenshots of each equation in the PDF; Mathpix returns LaTeX. Paste each equation into the corresponding location in your draft LaTeX document. For papers with 100+ equations, this is hours of work; budget accordingly.
  5. Manual cleanup and compile. Set class file (article.cls or specific journal template). Add document preamble (packages, author, title). Compile with pdflatex; fix errors one at a time. Iterate until the output PDF roughly matches the source. Plan 2–4 hours of cleanup for a typical paper.

When the conversion is and is not worth doing

Worth doing: paper you authored and lost source for, that you need to revise; paper you have explicit permission and reason to re-publish under a new template; large-corpus migration where many papers need LaTeX equivalents for an automated pipeline. Not worth doing: papers you only need to cite (cite the PDF, no conversion needed); papers where you only need to extract specific quotes or data (use PDF-to-text and copy-paste); papers where the original author is reachable and might share source (ask first).

For citation-management workflows, also consider Zotero or Mendeley with PDF-attachment storage. The reference database is structured; the underlying PDFs remain searchable. You rarely need LaTeX source of cited papers — you need clean BibTeX entries, which Zotero generates automatically from DOI metadata. Save the LaTeX- conversion effort for cases where the LaTeX output itself is the deliverable.

One more consideration: when working from a PDF that was originally authored in Word or a similar non-LaTeX tool, conversion to LaTeX does not magically produce a LaTeX-quality document. The typographic refinement that LaTeX is known for (microtypography, careful spacing, ligatures, hyphenation) comes from the LaTeX engine, not from the converter. A converted-to-LaTeX document compiles to PDF using LaTeX's typesetting; the result usually looks better than the Word-derived source but is not the equivalent of a hand-written LaTeX document where the author chose packages and commands deliberately. For papers being re-published in a LaTeX-native workflow, expect substantial manual refinement on top of the automated conversion. The conversion gets you started; the refinement is what gets you the LaTeX quality.

Related reading

FAQ

Why would I convert a PDF to LaTeX rather than just citing the PDF?
Three reasons. First, re-formatting: you want to publish a derivative work or extended version under a different style guide / journal template. Converting back to LaTeX lets you reformat once and re-export to multiple targets. Second, citation re-extraction: extracting clean BibTeX references from a paper's bibliography is much easier from LaTeX source than from a PDF. Third, your own work: you wrote the paper, lost the LaTeX source, and need to recover it to revise. For citation-only use, just cite the PDF — no conversion needed. The conversion is for content reuse, not for reference.
Is the paper on arXiv? Check there first.
For papers hosted on arXiv, you can often download the original LaTeX source directly — arxiv.org provides Format → Source or .tar.gz download for most submissions. This is 100% fidelity reconstruction, no conversion needed. Before reaching for conversion tools, search arXiv (and similar preprint repositories: ChemRxiv, bioRxiv, SSRN) for the paper. If the author posted source there, your "PDF to LaTeX" problem is solved by download. For journal-only publications without preprint posting, conversion is the remaining path.
How accurately do tools convert math equations?
Mathpix is the current best-in-class for equation OCR — accuracy on clean printed math is 95%+ across most common notation. Tesseract and general-purpose OCR fail on math; do not attempt without a math-specific tool. For papers with extensive math (theoretical physics, mathematics, statistics), Mathpix or hand-transcription are the realistic options. For papers with occasional math (CS, ML, applied work), Mathpix on the equation pages plus general OCR / extraction on the rest is the hybrid that minimises cost while producing usable LaTeX output.
What about tables and figures in academic papers?
Tables convert reasonably well via pdfplumber or Camelot (output as LaTeX tabular or as CSV that you can convert). Figures are images; LaTeX includes them as \includegraphics references to extracted image files. Quality of figures depends on the source — figures embedded at 300 DPI come through cleanly; low-resolution screen figures come through soft. For papers being re-published, replace low-quality figures with high-quality originals if available (most authors keep figure source files separately from the paper PDF).
Can the conversion produce a paper that compiles directly?
Usually not on the first pass. The converted LaTeX will compile to something approximating the original but with formatting drift — class file, packages, layout commands all need manual review. The conversion produces a working starting point that you then refine. Plan for 2–4 hours of manual cleanup per converted paper on top of the automated conversion time. For papers you do not need to extensively re-format, accept the imperfect conversion and move on; for papers you intend to re-publish, the manual polish is necessary.

Citations

  1. LaTeX Project — official LaTeX documentation.
  2. arXiv — author guide and source-download documentation.
  3. Mathpix — Snip documentation for math OCR.
  4. pandoc — universal document converter documentation.

Extract clean text from research PDFs in your browser

ScoutMyTool PDF to text runs client-side; useful as the first step before piping into pandoc or LaTeX-conversion tooling.

Open PDF-to-text →