6 min read
How to extract footnotes from an academic PDF
By ScoutMyTool Editorial Team ยท Last updated: 2026-05-22
Introduction
Anyone who has tried to pull the footnotes out of a humanities article knows the frustration: copy the page and you get the body text and the footnotes braided together in an order that makes neither readable. The reason is fundamental โ a PDF does not know what a footnote is. It stores text at positions, and a footnote is just more text near the bottom of the page, distinguished only by where it sits and how it looks, not by any structural label. This guide explains how to extract footnote and endnote content from a scholarly PDF anyway: separating notes from body text by position, handling the superscript markers, turning reference lists into citations, dealing with scans, and verifying that what you pulled out is complete and correct.
Note types and how to extract them
| Note type | Where it lives | Extraction tip |
|---|---|---|
| Footnotes | Bottom of each page | Separate from body by page position |
| Endnotes | End of chapter/document | Often a clean contiguous block |
| Reference list | End, structured | Parse by entry; export to citations |
| In-text citations | Inline (Author, Year) | Pattern-match |
| Superscript markers | In body text | Link marker to its note |
Step by step โ extract footnotes cleanly
- OCR first if it is a scan. No text means nothing to extract; run PDF OCR (see OCR + reformat) and expect to verify the small note text carefully.
- Extract text respecting layout. Pull the text in a way that preserves page structure so the bottom-of-page note region can be told apart from the body โ naive copy-paste interleaves them.
- Separate notes by position. Use the cues โ page-bottom location, smaller font, separator rule โ to isolate footnotes; collect endnotes from their contiguous block.
- Pair markers to notes if needed. Match each body superscript marker to its numbered note when you want the linkage; ignore markers if you only want note content.
- Turn references into citations. For a reference list, recover DOIs and generate citations with the Citation Formatter โ see paper management & citation export.
- Move to an editable doc if reworking. Convert with PDF to Word to edit the extracted notes, accepting layout cleanup.
- Verify completeness and correctness. Count markers vs. extracted notes, check separation, and spot-check references against the original.
Related reading and tools
- Paper management & citation export: turning references into citations.
- Academic research workflow: the broader research toolkit.
- OCR + reformat: for scanned papers.
- Extracting complex tables: another layout-inference extraction.
- PDF to spreadsheet: structured data extraction.
- Citation Formatter: build citations from references.
- All ScoutMyTool PDF tools: the full toolkit.
FAQ
- Why are footnotes hard to extract from a PDF?
- Because a PDF has no concept of "this text is a footnote" โ it stores characters at positions on the page, and the footnote is just more text near the bottom, visually separated from the body by position and size but not by any structural tag. To a text-extraction tool, the page is one stream, so naive extraction interleaves footnote text with body text, or scrambles the order, especially across columns and page breaks. Separating footnotes cleanly means using their position (bottom of page, smaller font, often above a separator line) to distinguish them โ inference, not a labelled field. Well-tagged accessible PDFs are easier; most scholarly PDFs are not tagged that way, so expect to verify the result.
- How do I separate footnote text from the body?
- Use the cues a footnote has: it sits at the bottom of the page, typically in a smaller font, often below a short horizontal rule, and is keyed by a superscript marker in the body. Extraction that respects page layout can split the bottom-of-page note region from the main text flow. After extracting, you usually reassemble: collect each page's footnotes, and if you want them tied to their references, match each note number to its superscript marker in the body. Endnotes are easier because they form a contiguous block at the end. For a clean result on a complex multi-column paper, expect a manual tidy-up pass after the automated extraction.
- What about the superscript markers in the body text?
- The little superscript numbers that key footnotes are part of the body text stream, so a plain text extraction pulls them inline (sometimes as normal-size digits, which can be confusing). If you want to preserve the marker-to-note relationship, you extract both the markers (with their positions in the text) and the notes (from the page bottom) and pair them by number. If you only want the note content, you can often ignore the markers. The thing to watch is that an extracted "5" inline might be a footnote marker, a page number, or data โ context matters, so verify rather than assuming every stray digit is a note reference.
- Can I extract the reference list and turn it into citations?
- Yes, and it is one of the most useful outcomes. A reference list or bibliography is a structured block at the end, so you can extract it and, where each entry has a DOI or enough identifying detail, convert it into proper citation records (BibTeX/RIS) for your reference manager rather than retyping. The most reliable path is to recover each reference's DOI and regenerate the citation from an authoritative source, since the formatting in the paper may not match your needed style. For footnote-style citations (common in humanities), the notes themselves contain the bibliographic detail, so extracting the notes is extracting the citations.
- What if the paper is a scanned PDF?
- Then there is no text to extract until you OCR it, and footnotes are an especially tricky OCR case: the small footnote font OCRs less accurately than body text, superscript markers are easily misread, and dense reference lists with abbreviations and diacritics are error-prone. So for a scanned scholarly PDF: OCR first, then extract and separate the notes, and verify carefully โ footnote and citation extraction from a scan is exactly where OCR errors hide. For a critical extraction, reconcile each note against the original page rather than trusting the OCR, especially page numbers, years, and author names in references.
- How do I verify the extracted footnotes are complete and correct?
- Check completeness first: count the note markers in the body against the number of extracted notes โ they should match, and a mismatch means a note was missed or merged. Then spot-check that notes are correctly separated from body text (no body sentences captured as notes, no notes left in the body), and that multi-page or continued footnotes were assembled correctly. For references, verify a sample against the original. Footnote extraction is inherently inference-based, so "extract, then verify against the source" is essential โ the automated result is a strong draft, not a guaranteed-correct one, particularly on complex or scanned papers.
- Is it safe to extract from an unpublished paper online?
- Unpublished manuscripts and papers under review are confidential, so prefer a tool that processes the file locally. ScoutMyTool runs text extraction, OCR, and citation formatting entirely in your browser tab, so the paper never leaves your machine. For anything pre-publication or under review, confirm the tool does not upload before using it.
Citations
- Wikipedia โ โNote (typography),โ on footnotes and endnotes. en.wikipedia.org/wiki/Note_(typography)
- Wikipedia โ โCitation,โ on the reference content footnotes often carry. en.wikipedia.org/wiki/Citation
- Wikipedia โ โPDFโ (ISO 32000), the position-based model that has no native footnote concept. en.wikipedia.org/wiki/PDF
Get the notes out, cleanly
Extract text, OCR scans, and turn references into citations with ScoutMyToolโs in-browser tools โ your papers, including unpublished ones, never leave your machine.
Open the Citation Formatter โ