7 min read
PDF for science researchers: paper management and citation export
By ScoutMyTool Editorial Team ยท Last updated: 2026-05-21
Introduction
By the end of my PhD my downloads folder held about nine hundred PDFs, half of them named download.pdf or 1-s2.0-S000.pdf, and rebuilding a bibliography from that pile the week before a submission deadline was its own special misery. The system I wish I had started with is simple and tool-agnostic: name files consistently, capture the DOI the moment you save a paper, and export citations from the identifier rather than retyping them. This article lays out that workflow for managing research-paper PDFs and getting clean, correctly formatted citations out of them โ the file hygiene, the metadata capture, the export formats, and the handling of scanned papers and supplements that trip people up.
Citation export formats โ which to use
โExport the citationโ can mean half a dozen different files depending on your reference manager and writing tool. Here is what each format is for so you pick the right one once instead of converting later.
| Format | Used by | Notes |
|---|---|---|
| BibTeX (.bib) | LaTeX / Overleaf workflows | Plain-text, key-based; the default for math, physics, CS |
| RIS (.ris) | EndNote, Mendeley, Zotero import | Tagged line format; the lingua franca for reference managers |
| CSL-JSON | Zotero, Citation Style Language tools | Structured JSON; drives 10,000+ output styles |
| DOI string | Anywhere โ resolves to full metadata | The most durable identifier; one line you can expand later |
| EndNote XML | EndNote desktop libraries | Verbose XML; mostly for EndNote-to-EndNote transfer |
| Formatted text | Pasting into a manuscript bibliography | APA/MLA/Chicago/IEEE rendered string; check style edition |
Step by step โ manage a paper and export its citation
- Rename on download. Save each paper as
FirstAuthor_Year_ShortTitle.pdfand drop it in the right project folder. Doing this at download time costs seconds; doing it later for nine hundred files costs a weekend. See PDF naming conventions for a scheme that scales. - Capture the DOI and metadata. Note the DOI from the first page or footer. If it is missing, read the PDFโs document properties for the embedded title and author. The DOI is the one field worth never losing โ everything else can be regenerated from it.
- OCR scanned papers first. If a paper is image-only, run PDF OCR to add a text layer so you can select the title and DOI. Verify author names, which OCR handles least reliably.
- Export the citation. Convert the DOI or captured metadata into your format with the Citation Formatter โ BibTeX for LaTeX, RIS for EndNote/Mendeley/Zotero โ and import it into your reference manager rather than typing it.
- Merge supplements, split bundles. Combine a main paper with its supplementary information into one archival file using Merge PDF, or break a multi-paper download into individual files with Split PDF so each paper is one citable file.
- Compress the heavy ones. Scanned and figure-heavy PDFs bloat a library; compress those to ~150โ200 DPI for storage and sharing, keeping a full-resolution copy only where you need it.
- Back up the library. Keep the renamed PDFs and your exported citation files (.bib/.ris) together and backed up, so the bibliography is reproducible even if your reference manager database is lost.
Related reading and tools
- Academic PDF workflows: the broader thesis-and-publishing toolkit.
- Extracting complex tables: pulling data tables out of papers.
- Preparing PDFs for LLMs: feeding papers to AI reading tools cleanly.
- Embedding fonts: keeping equations and symbols intact in submissions.
- Adding a table of contents: navigating merged papers and supplements.
- Citation Formatter tool: export BibTeX/RIS in your browser.
- All ScoutMyTool PDF tools: the full toolkit.
FAQ
- How should I name and organise downloaded paper PDFs?
- A consistent naming scheme is the single highest-return habit, because it makes papers findable years later without a database. A widely used pattern is FirstAuthorLastname_Year_ShortTitle.pdf โ for example Hassabis_2024_AlphaFold3.pdf. Keep one folder per project or per topic rather than one giant downloads folder, and avoid spaces and special characters that break command-line and sync tools. The goal is that you can identify a paper from its filename alone and that files sort sensibly. If you use a reference manager, let it own the canonical copy and rename attachments automatically; if you do not, the manual scheme is enough for a few hundred papers.
- How do I get a citation out of a paper PDF?
- There are three reliable routes, in order of preference. First, find the DOI โ most modern papers print it on the first page or in the footer; one DOI resolves to complete, authoritative metadata you can convert into any citation format. Second, read the PDF's embedded document metadata (title, author, sometimes DOI) from its properties. Third, if neither exists (older or scanned papers), extract the title and author text from the first page and look it up. Once you have a DOI or clean title, export to BibTeX or RIS and import into your reference manager. Avoid retyping citations by hand โ it is the main source of bibliography errors.
- What is a DOI and why does it matter for citation?
- A DOI (Digital Object Identifier) is a permanent, unique identifier assigned to a publication โ something like 10.1038/s41586-024-07487-w. Unlike a URL, it does not break when a journal reorganises its site, because it resolves through a central system to the current location. For citation work it is gold: registries such as Crossref let you turn a single DOI into a fully formatted reference in any style, with correct author lists, page numbers, and dates. Capturing the DOI when you save a paper means you never have to reconstruct the citation later; everything else can be regenerated from it.
- Can I extract citations from a scanned (image-only) PDF?
- Not directly โ an image-only PDF has no text layer, so there is nothing to copy or parse. Run an OCR pass first to add a searchable text layer, then extract the title, authors, and any DOI from the recognised text. OCR accuracy on a clean journal scan is high for body text but can mangle author names with diacritics and complex reference lists, so verify the key fields. The most reliable shortcut even for scans is to OCR just the first page, recover the title or DOI, and then pull the authoritative metadata from a DOI registry rather than trusting the OCR of the full reference string.
- How do I combine a paper with its supplementary materials?
- Supplements often arrive as separate files โ a main PDF plus supplementary PDFs, sometimes figures or tables. For a single archival copy, merge them in reading order: main text first, then supplementary information, then any appendices, and add a bookmark or a table of contents so a reader can jump between sections. Conversely, when a single download bundles several papers or a whole proceedings into one large PDF, split it into individual papers so each can be named, cited, and managed separately. Keeping one paper per file is what makes a reference library tidy.
- Should I compress paper PDFs to save space?
- Selectively. A text-based journal PDF is already small and gains little from compression. The candidates worth compressing are scanned papers and figure-heavy PDFs, which can be tens of megabytes each and add up fast across a large library. Compress those to a sensible resolution (around 150โ200 DPI keeps figures readable), and keep an uncompressed copy of anything you might need at full resolution for re-use, such as a figure you plan to reproduce. For sharing with collaborators or attaching to email, a compressed copy avoids attachment-size limits without harming legibility.
- Is it safe to use online tools with unpublished or embargoed manuscripts?
- Treat unpublished manuscripts, papers under review, and embargoed work as confidential. Many online PDF tools upload your file to their servers, which may breach a journal's confidentiality terms or a co-author agreement. ScoutMyTool runs its PDF operations client-side in your browser tab, so the manuscript never leaves your machine โ appropriate for sensitive drafts and peer-review material. For any document you would not post publicly, confirm the tool processes locally or that your institution has an agreement with the vendor before uploading.
Citations
- International DOI Foundation โ the Digital Object Identifier system that assigns persistent identifiers to publications. doi.org
- Crossref โ DOI registration agency whose API turns a DOI into formatted reference metadata. crossref.org
- Wikipedia โ โBibTeX,โ the reference-list format used in LaTeX and supported by major reference managers. en.wikipedia.org/wiki/BibTeX
Export a clean citation in seconds
ScoutMyToolโs Citation Formatter and OCR run entirely in your browser tab โ your papers, including unpublished manuscripts, never leave your machine.
Open the Citation Formatter โ