6 min read
How to extract data from PDF for AI / LLM input (LangChain, llama-index)
By ScoutMyTool Editorial Team ยท Last updated: 2026-05-20
Introduction
Every RAG system I have shipped this year has hit the same wall first: the enterprise documents people want to query are PDFs, and clean text extraction from arbitrary PDFs is harder than the LangChain tutorials make it look. Born-digital text PDFs are easy; multi-column research papers, scanned-then-OCR\'d documents, tables-as-figures financial reports, and mixed-source bundles are not. This article walks through the extractor landscape, the chunking strategy that compounds across document types, and the specific patterns for OCR pre-processing, table preservation, and very large documents.
PDF text extractors compared
| Extractor | Best for | Weakness |
|---|---|---|
| PyPDF / pypdf | Quick text extraction, low setup cost, born-digital PDFs | Tables become whitespace-separated text; multi-column papers scrambled |
| pdfplumber | Tables, structured documents, financial reports | Slower; no OCR |
| pdfminer.six | Lower-level character extraction, position-aware text | Verbose output; needs post-processing |
| unstructured.io | Heterogeneous documents (scanned, mixed, complex layouts) | Heavier dependency footprint; slower per page |
| PyMuPDF / fitz | Speed; preserves text positioning; good column detection | AGPL license โ incompatible with some commercial use |
| Marker (mark-pdf-to-markdown) | High-quality conversion to Markdown with structure preserved | GPU recommended for speed; heavier setup |
| Tesseract via OCRmyPDF | Scanned PDFs and image-only pages | OCR-quality dependent on scan quality; slower |
Step by step โ build a RAG-ready PDF pipeline
- OCR scanned pages first. Use Make PDF Searchable or OCRmyPDF to add a text layer to image-only PDFs. Skip born-digital PDFs.
- Extract text with the right extractor for the document type: pypdf for simple, pdfplumber for tables, unstructured.io for complex mixed documents.
- Chunk on natural boundaries. Use LangChain RecursiveCharacterTextSplitter with chunk_size=1000, chunk_overlap=200, split on paragraph/sentence boundaries.
- Embed chunks with a current model (text-embedding-3-small from OpenAI; voyage-3 from Voyage; bge-large-en open-source). Store in a vector index (Pinecone, Weaviate, sqlite-vec, hnswlib).
- Verify retrieval quality. Run 10โ20 known-good queries against the index, inspect the top-5 retrieved chunks per query, confirm relevance. Tweak chunking or embedding model if retrieval is poor.
Related reading
- PDF to Markdown: best intermediate format for LLM input.
- PDF to text: simpler extraction for plain prose.
- PDF table to CSV: structured extraction for tabular content.
- Searchable PDF: OCR step before any text extraction.
- Scanned PDF to Word: alternative format for paragraph reconstruction.
- PDF table of contents: section detection for hierarchical chunking.
- All ScoutMyTool PDF tools: browser-based extraction tools.
FAQ
- I extracted text from a PDF and the result is garbled. What is the issue?
- Three usual causes. First, the PDF is image-only (scanned) and you ran a text-extractor that does not OCR โ the extractor returns empty or near-empty text. Run OCRmyPDF or ScoutMyTool Make PDF Searchable first to add a text layer, then re-extract. Second, the PDF uses an unusual font with private-use Unicode mapping โ the visible glyphs are correct but the underlying character codes are non-standard. Fix by re-rendering the PDF (Ghostscript with `-dCompatibilityLevel=1.7 -sDEVICE=pdfwrite`) or by running OCR over the file to bypass the encoding issue. Third, the PDF is two-column or multi-column and your extractor reads left-to-right across columns; switch to pdfplumber or PyMuPDF which handle column layouts better.
- What is the right chunking strategy for RAG (retrieval-augmented generation)?
- Two parameters: chunk size and chunk overlap. Common defaults: 1,000 characters per chunk with 200-character overlap. The justification: most retrieval models embed sentences up to ~512 tokens (about 2,000 characters), so chunks of 1,000 characters fit comfortably; overlap preserves context across boundaries so a query that hits the end of one chunk also retrieves the start of the next. For dense technical content (legal contracts, research papers), use smaller chunks (500 characters) so each chunk is more topically uniform. For narrative content, larger chunks (1,500) preserve context. Always chunk on natural boundaries (paragraph or sentence) rather than mid-sentence โ LangChain's RecursiveCharacterTextSplitter handles this. Test by retrieving for known-good queries and seeing which chunks come back.
- How do I preserve table structure when extracting for LLM input?
- Tables matter for LLMs because flattening them to whitespace-separated text loses the semantic relationship between cells. Three working approaches. First, extract tables separately using pdfplumber or Camelot, convert to Markdown table format, embed in the document text where they appeared. Second, convert the entire PDF to Markdown using Marker or unstructured.io, which preserve table structure as Markdown tables. Third, extract tables as JSON ({"headers": [...], "rows": [[...]]}) and feed as structured data alongside the document text. The LLM understands all three formats but Markdown tables are the most token-efficient and have the best LLM performance in our experience.
- My RAG over PDF documents returns wrong answers โ what do I check?
- Five-step debug. First, manually verify the source PDF contains the correct answer (sometimes the document itself is wrong or ambiguous). Second, check the embedding quality โ search the vector index for the query and see what chunks come back; if the right chunk is not in the top 5 results, your chunking or embedding model is the issue. Third, check if the retrieved chunk contains the answer text โ if yes, the LLM is failing to use the context; tweak the prompt or switch model. Fourth, verify chunking โ if the answer spans two chunks because of an unfortunate cut, the LLM has only half the context; increase overlap. Fifth, check OCR quality if the source is scanned โ bad OCR means garbage embeddings.
- Which LangChain PDF loader should I use?
- Depends on the document. PyPDFLoader (uses pypdf) is the default and fastest โ fine for born-digital, single-column documents. UnstructuredPDFLoader (uses unstructured.io) handles mixed/complex/scanned PDFs but is slower. PDFMinerLoader gives finer-grained position info, useful when chunking on visual layout. PyMuPDFLoader is the fastest while preserving structure (note the AGPL licence). For complex documents, the unstructured.io loader is usually the right choice despite the speed cost. For RAG over large PDF corpora, profile a sample with each loader and pick based on accuracy vs throughput trade-off.
- Can I do PDF extraction client-side in the browser, without a Python server?
- Yes. ScoutMyTool's PDF to text and PDF to Markdown tools run entirely in the browser using pdf-lib and PDF.js. Your PDF never uploads to a server. For client-side RAG (embedding generation in browser using Transformers.js, vector search using sqlite-vec WASM or hnswlib-wasm, retrieval prompts to Claude/Gemini/etc via their browser SDKs), the whole pipeline can stay in-browser. This is meaningful for privacy-sensitive documents (legal, medical, financial) where you do not want to spin up a server that ever holds the documents.
- How do I handle very large PDFs (1,000+ pages) for LLM input?
- Two strategies. First, do not embed the whole document into a single LLM context โ use RAG instead, where you embed and index chunks, and retrieve only relevant chunks at query time. Even for models with 1M-token context windows, throwing a whole 1,000-page document into context per query is expensive and slow. Second, when you do need the full document context (e.g. summarisation), use hierarchical summarisation: summarise each chapter independently, then summarise the summaries. This map-reduce pattern handles arbitrarily long documents with bounded per-call cost. LangChain has a SummarizeChain that implements this directly.
Citations
- LangChain documentation โ document loaders and text splitters for PDF input.
- LlamaIndex documentation โ PDF reader integrations and chunking strategies.
- unstructured.io documentation โ complex-document parsing with layout awareness.
- OpenAI Embeddings documentation โ text-embedding-3-small reference and best practices.
- Voyage AI documentation โ voyage-3 retrieval embeddings.
Browser-based PDF extraction for LLM workflows
ScoutMyTool's PDF-to-text and PDF-to-Markdown run client-side. Your knowledge-base documents stay on your machine during extraction; only the (anonymisable) embeddings need to be sent anywhere.
Open PDF-to-Markdown tool โ