What JSON schema should I extract PDFs into?

Match the schema to the downstream use case. For full-text indexing or RAG embedding: a flat list of `{page, text}` records is sufficient. For structured data extraction (invoices, statements): a typed schema with `{vendor, date, line_items[{description, qty, unit_price, total}], total}` keyed to the document type. For document-AI pipelines feeding into LLMs: unstructured.io's element-list format (`{type: "Title|NarrativeText|ListItem|Table|Image", text, metadata}`) is the de facto standard and integrates with most LLM frameworks. Avoid mirroring the PDF visual structure directly — JSON is for downstream code consumption, not visual fidelity.

How do I handle tables that span multiple pages?

Detect continuation by header matching. When a table on page N has the same column headers as the table on page N+1, concatenate the rows into a single table object in the output JSON. Camelot and pdfplumber expose per-table headers; compare them programmatically. For financial statements (transaction logs, account statements) the multi-page pattern is common and worth handling at extraction time rather than at downstream parse time. Store a `pages` array in the table JSON so you can trace back to the source page numbers when debugging extraction errors.

Can I extract from a scanned PDF directly to JSON?

Not without OCR first. The PDF is image-only; no extractor can produce structured JSON without text to extract. Two-step path: OCR the PDF (Tesseract, OCRmyPDF, Google Vision, AWS Textract, Azure Form Recognizer), then run a JSON-emitting extractor on the now-text-bearing PDF. For high-volume scanned-document pipelines, the cloud document-AI services (Textract, Form Recognizer, Document AI) combine OCR and structured extraction in one API call — typically more accurate than DIY two-step for complex forms, at $0.50–$5 per 1000 pages depending on plan.

How do I preserve bounding-box positions in the JSON output?

Use pdfplumber or PyMuPDF — both emit `{text, x0, y0, x1, y1, page}` per text fragment. The bounding boxes are useful for visual highlighting (showing the extracted region overlaid on the source PDF), reproducible debugging (jump back to the exact source location of any extracted value), and feeding into multimodal LLMs that consume position-aware text. For pure-text downstream use cases the bounding boxes are noise; flatten to plain text. For document-AI applications the bounding boxes are essential signal; preserve them.

What about JSON output from unstructured.io specifically?

unstructured.io is purpose-built for the LLM-pipeline use case and produces a list of typed elements: Title, NarrativeText, ListItem, Table (with HTML representation), Image, PageBreak. Each element carries metadata including page number, coordinates (optionally), and detection confidence. The element list is the right granularity for chunking before embedding — each element is roughly one semantic unit. LangChain and LlamaIndex both have first-class unstructured.io loaders. For new pipelines in 2026, unstructured.io is the default; reach for lower-level tools when you need finer control or a smaller dependency footprint.

Can I do this client-side in the browser without a Python backend?

Yes. PDF.js extracts text and positions; pdf-lib handles structure. Both run in browser via JavaScript. ScoutMyTool PDF to text and PDF to CSV demonstrate the pattern — your PDF stays on the machine, the extraction happens in the browser tab. For client-side JSON emission, write a small wrapper around PDF.js that produces your desired schema. The constraint: browser memory is bounded, so very large PDFs (1000+ pages) may need chunking. For typical document-extraction workloads under 100 pages, client-side is comfortable and avoids the server-side privacy concerns.

How do I validate the extracted JSON against my schema?

JSON Schema (json-schema.org) is the industry standard. Define your expected document schema once, validate every extraction against it before downstream use. For Python pipelines, the `jsonschema` library handles validation; for TypeScript / JavaScript, `ajv` is the canonical choice. Validation catches extraction errors early — a missing required field surfaces as a schema-validation failure rather than as a downstream pipeline crash. For high-value extractions (financial reports, regulatory filings), build assertion checks beyond schema: totals reconcile, dates are within reasonable ranges, line items sum to the stated total. The assertions catch semantic errors that schema validation cannot.

7 min read

How to convert PDF to JSON — data extraction for developers

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

Every data pipeline I have built in the last three years has had a PDF ingestion stage at the front — vendor invoices, regulatory filings, partner statements, research data tables. Getting clean JSON out the other side is the most leveraged 200 lines of code in the stack: get it right and downstream code is a joy; get it wrong and every consumer of the data spends time on edge cases. This article maps the Python and JavaScript tools that produce JSON from PDFs, the schema-design choices that affect downstream usability, and the validation patterns that catch extraction errors before they propagate.

PDF-to-JSON tools compared

Tool	Language	Best for	Default JSON shape
pypdf	Python	Plain-text extraction; small scripts	Flat per-page strings
pdfplumber	Python	Tables, structured documents, financial statements	Tables as rows[][cells]; text positions
pdfminer.six	Python	Low-level character extraction with coordinates	Character-by-character with x/y positions
PyMuPDF (fitz)	Python	Speed; tables; text-with-bbox extraction	Page tree with blocks → lines → spans
unstructured.io	Python	Mixed / scanned / complex layouts; LLM pipelines	Element list with type, text, metadata
Camelot / Tabula-java	Python / Java	Table extraction; lattice and stream modes	Tables as nested arrays; one JSON per table
pdf-lib + pdfjs	JavaScript / browser	Client-side extraction; no server required	Custom — design your own structure

Step by step — build a PDF-to-JSON pipeline

Profile a sample of source PDFs. Open three or four representative documents. Note: text-based vs scanned, single vs multi-column, tables vs prose, consistent vs varying layout. The profile drives tool choice — text-based + single-column + prose maps to pypdf; scanned + complex tables maps to OCR + unstructured.io.
Define the target JSON schema. What does downstream code expect? Write the schema first as JSON Schema or a TypeScript / Pydantic type. The schema is the contract; the extractor is implementation. Schema first prevents the common trap of letting the extractor library shape your data model.
Pick the extractor. Match to the profile from step 1. For most pipelines: pdfplumber or PyMuPDF for born-digital with structure; unstructured.io for mixed or complex. For browser-only workflows: pdf.js wrapped in a small extraction layer.
Write the extraction + transform code. 80% of the code is transformation from extractor output to your schema, not extraction itself. Map fields, normalise dates and currencies, validate types. Test with the sample PDFs from step 1; expand test fixtures as new edge cases surface in production.
Validate every output against the schema. JSON Schema validation catches missing required fields; assertion checks (totals reconcile, dates within ranges) catch semantic errors. Failed validations get queued for human review; passing extractions flow into the pipeline. The validation step is what makes the pipeline trustworthy at scale.

Schema design patterns that compound

Three patterns I have seen pay back across multiple pipelines. First, include source provenance in every output: the original filename, the extractor version, the extraction timestamp. When downstream code finds a bug, provenance lets you re-run the same extraction and verify whether the bug is in the source PDF or the extractor. Without provenance, debugging production data quality issues is detective work.

Second, version the schema explicitly. A `schema_version` field at the top level lets downstream code reject unfamiliar versions and lets you evolve the schema without breaking existing consumers. Schema changes are inevitable as new edge cases drive new fields; treating the schema as a stable contract via versioning is the only sustainable pattern.

Third, distinguish between "missing" and "null" values in the JSON. `null` means "the source PDF had no value here, and that is normal"; missing-field means "the extractor did not look for this field". The semantic difference matters for downstream code that handles missing data; conflating the two produces silent bugs. JSON Schema with explicit `required` versus `nullable` clauses captures the distinction.

PDF for LLM input: extracting PDFs for RAG and embedding pipelines.
PDF table to CSV: tabular extraction; CSV is a degenerate JSON.
PDF to text: simpler extraction for prose content.
Searchable PDF: OCR step before any structured extraction.
PDF to Markdown: intermediate format for documentation pipelines.
PDF to spreadsheet: when the target is Excel rather than JSON.
PDF form email submit: form-derived JSON via FDF parsing.

FAQ

What JSON schema should I extract PDFs into?: Match the schema to the downstream use case. For full-text indexing or RAG embedding: a flat list of `{page, text}` records is sufficient. For structured data extraction (invoices, statements): a typed schema with `{vendor, date, line_items[{description, qty, unit_price, total}], total}` keyed to the document type. For document-AI pipelines feeding into LLMs: unstructured.io's element-list format (`{type: "Title|NarrativeText|ListItem|Table|Image", text, metadata}`) is the de facto standard and integrates with most LLM frameworks. Avoid mirroring the PDF visual structure directly — JSON is for downstream code consumption, not visual fidelity.
How do I handle tables that span multiple pages?: Detect continuation by header matching. When a table on page N has the same column headers as the table on page N+1, concatenate the rows into a single table object in the output JSON. Camelot and pdfplumber expose per-table headers; compare them programmatically. For financial statements (transaction logs, account statements) the multi-page pattern is common and worth handling at extraction time rather than at downstream parse time. Store a `pages` array in the table JSON so you can trace back to the source page numbers when debugging extraction errors.
Can I extract from a scanned PDF directly to JSON?: Not without OCR first. The PDF is image-only; no extractor can produce structured JSON without text to extract. Two-step path: OCR the PDF (Tesseract, OCRmyPDF, Google Vision, AWS Textract, Azure Form Recognizer), then run a JSON-emitting extractor on the now-text-bearing PDF. For high-volume scanned-document pipelines, the cloud document-AI services (Textract, Form Recognizer, Document AI) combine OCR and structured extraction in one API call — typically more accurate than DIY two-step for complex forms, at $0.50–$5 per 1000 pages depending on plan.
How do I preserve bounding-box positions in the JSON output?: Use pdfplumber or PyMuPDF — both emit `{text, x0, y0, x1, y1, page}` per text fragment. The bounding boxes are useful for visual highlighting (showing the extracted region overlaid on the source PDF), reproducible debugging (jump back to the exact source location of any extracted value), and feeding into multimodal LLMs that consume position-aware text. For pure-text downstream use cases the bounding boxes are noise; flatten to plain text. For document-AI applications the bounding boxes are essential signal; preserve them.
What about JSON output from unstructured.io specifically?: unstructured.io is purpose-built for the LLM-pipeline use case and produces a list of typed elements: Title, NarrativeText, ListItem, Table (with HTML representation), Image, PageBreak. Each element carries metadata including page number, coordinates (optionally), and detection confidence. The element list is the right granularity for chunking before embedding — each element is roughly one semantic unit. LangChain and LlamaIndex both have first-class unstructured.io loaders. For new pipelines in 2026, unstructured.io is the default; reach for lower-level tools when you need finer control or a smaller dependency footprint.
Can I do this client-side in the browser without a Python backend?: Yes. PDF.js extracts text and positions; pdf-lib handles structure. Both run in browser via JavaScript. ScoutMyTool PDF to text and PDF to CSV demonstrate the pattern — your PDF stays on the machine, the extraction happens in the browser tab. For client-side JSON emission, write a small wrapper around PDF.js that produces your desired schema. The constraint: browser memory is bounded, so very large PDFs (1000+ pages) may need chunking. For typical document-extraction workloads under 100 pages, client-side is comfortable and avoids the server-side privacy concerns.
How do I validate the extracted JSON against my schema?: JSON Schema (json-schema.org) is the industry standard. Define your expected document schema once, validate every extraction against it before downstream use. For Python pipelines, the `jsonschema` library handles validation; for TypeScript / JavaScript, `ajv` is the canonical choice. Validation catches extraction errors early — a missing required field surfaces as a schema-validation failure rather than as a downstream pipeline crash. For high-value extractions (financial reports, regulatory filings), build assertion checks beyond schema: totals reconcile, dates are within reasonable ranges, line items sum to the stated total. The assertions catch semantic errors that schema validation cannot.

Citations

ISO 32000-1:2008 — "Document management — Portable document format" — text and structure model.
RFC 8259 — "The JavaScript Object Notation (JSON) Data Interchange Format".
JSON Schema specification — Draft 2020-12 (json-schema.org).
unstructured.io — open-source document parsing library documentation.
pdfplumber, PyMuPDF (fitz), and pypdf — Python library documentation.
Mozilla PDF.js — JavaScript PDF rendering and extraction library.

Browser-based PDF text extraction for JSON pipelines

ScoutMyTool PDF to text runs client-side. For sensitive documents where you want a quick text dump without exposing the PDF to a server, the browser tool is the privacy-safe starting point — pipe the extracted text into your downstream JSON-shaping code.

Open PDF-to-text tool →