How to convert PDF to clean text for ChatGPT / Claude input

Get clean text from PDFs for ChatGPT, Claude, and other LLMs โ€” extraction, cleanup, chunking.

6 min read

How to convert PDF to clean text for ChatGPT / Claude input

By ScoutMyTool Editorial Team ยท Last updated: 2026-05-21

The most common use of LLMs in my work this year has been asking questions about PDFs โ€” research papers, regulatory filings, vendor contracts. The quality of the LLM's answer depends heavily on the quality of the text the LLM sees, and the LLM's built-in PDF parser is good but not great. A small upstream cleanup step often turns a vague LLM response into a useful one. This article maps the workflow, the situations where direct PDF upload works fine, and the cases where manual extraction-and-cleanup produces meaningfully better results.

Approaches and trade-offs

ApproachBest forLimit
Drag PDF into ChatGPT / Claude directlySingle-document Q&A; PDFs under ~50 pagesQuality of LLM's built-in PDF parser varies; long PDFs hit context limits
Extract text, paste into chatBorn-digital PDFs where text-only is enoughLoses image content; manual cleanup needed for tables / footnotes
OCR scanned PDFs first, then extractScanned documents you want to queryOCR errors propagate into LLM output; verify accuracy
Convert to Markdown, pasteLong structured documents; preserves headingsTables and figures still imperfect
Use Claude Projects / ChatGPT Custom GPTRecurring queries against same document setTied to specific platform; some context-window limits
RAG (retrieval-augmented generation)Many PDFs; complex queries across a libraryRequires more setup; embedding service cost

Step by step โ€” clean PDF text for an LLM

  1. Decide between direct attach vs manual extraction.For text-clean born-digital PDFs under 50 pages, direct attach is fastest. For scanned PDFs, complex layouts, or content you want to clean up first, switch to manual extraction.
  2. Extract text with the right tool. Use ScoutMyTool PDF to text for client-side extraction (no upload); pdfplumber or Marker for command-line workflows. For scanned PDFs, run OCR first with ScoutMyTool Make PDF Searchable or OCRmyPDF.
  3. Clean the extracted text. Remove page-number lines, repeated headers and footers, equation residue, citation marker numbers like [1] [2] [3] that distract the LLM. Open the text in a code editor and use regex substitutions to strip patterns quickly. The cleanup takes 2โ€“5 minutes and meaningfully improves LLM response quality.
  4. Paste into the LLM with explicit context framing.Start the conversation with "Below is text extracted from a [research paper / contract / report]. Please answer questions about it based only on what is in the text." followed by the extracted text, then your specific question. The framing focuses the LLM on the document content rather than its general training.
  5. Iterate on follow-up questions in the same conversation.Most LLMs handle multi-turn conversation about a single document well; you do not need to re-paste the document for follow-up questions. Use the conversation flow to drill into specific sections or compare findings.

When to switch from chat to RAG

Three signals that one-document-at-a-time chat is the wrong architecture. First, you are querying the same document repeatedly and the per-conversation context re-paste is becoming tedious โ€” a Claude Project or ChatGPT Custom GPT keeps the document context persistent across conversations. Second, you are querying across many documents (a literature review, a regulatory archive, a knowledge base) where the relevant content for any given query comes from a different document โ€” RAG with embedding-based retrieval is the right fit. Third, the documents are too long for any single context window โ€” chunking and retrieving relevant chunks is what RAG does well.

For individual users running occasional queries, single-document chat is sufficient. For teams running production workloads or individuals with persistent reference corpora, the upgrade to Claude Projects, ChatGPT Custom GPTs, or a self-hosted RAG pipeline pays back. Match the architecture to the volume; do not pre- optimise for scale you do not have.

Related reading

FAQ

Can I just drag a PDF into ChatGPT or Claude directly?
Yes โ€” both ChatGPT (paid tiers since late 2023) and Claude.ai (since launch) accept PDF uploads as attachments. The chat interface extracts text from the PDF and feeds it as context for your questions. For most casual use cases (summarise this paper, what does it say about X, paraphrase the introduction), drag-and-drop works well. Two limits to know about. First, very long PDFs exceed the context window โ€” Claude handles up to 200k tokens (roughly 500 pages of dense text); ChatGPT limits vary by model. Second, the LLM's built-in PDF parser quality varies โ€” for complex multi-column or scanned PDFs, manual extraction with a dedicated tool then paste produces better results than direct attach.
When should I extract text manually and paste rather than upload the PDF directly?
Three situations. First, the PDF is scanned and the LLM's built-in OCR is unreliable โ€” run OCR with a known-good tool (ScoutMyTool Make PDF Searchable, OCRmyPDF) and verify accuracy before feeding the text to the LLM. Second, the PDF has complex layout (multi-column research paper, financial statement with tables) where the LLM's parser scrambles reading order โ€” extract via pdfplumber or Marker into clean Markdown, then paste. Third, you want to clean up the text first โ€” remove page headers, footers, equation residue, citation footnotes that would distract the LLM's answer. Manual extraction with a moment of cleanup produces noticeably better LLM responses.
How long can the document be before the LLM cannot handle it?
Depends on the model. Claude Sonnet / Opus handle up to 200k tokens (~500 pages); Claude with the 1M context extension handles up to 1M tokens (~2,500 pages). ChatGPT GPT-4 Turbo handles 128k tokens (~300 pages); GPT-4o around the same. Gemini 1.5 Pro handles up to 2M tokens. For documents exceeding the limit, three options. First, use Claude 1M context if your subscription includes it. Second, chunk the document into sections and query each separately. Third, build a RAG pipeline that retrieves only relevant chunks for each query โ€” the right approach for many-document or very-large-document use cases.
Are there privacy implications to uploading a PDF to ChatGPT or Claude?
Yes. Both ChatGPT and Claude have data-use policies that vary by account type. ChatGPT Free and Plus by default use your inputs to train future models unless you opt out; Enterprise and Team tiers do not train on your data. Claude does not train on your inputs by default for paid tiers but check the current policy. For confidential PDFs (contracts, financial statements, source materials, anything covered by NDA), use a paid tier with the train-on-input setting disabled, or use a self-hosted LLM. For research papers, public reports, and general reference content, the privacy implications are usually acceptable. Read the current policy before uploading anything sensitive.
What is RAG and when should I set one up?
Retrieval-Augmented Generation: instead of feeding the LLM an entire document and asking questions, embed chunks of the document into a vector database, retrieve the chunks most relevant to each query, and feed only those chunks to the LLM. Useful when (a) you have many documents to query across โ€” say a knowledge base of 500 research papers; (b) individual documents exceed context windows; (c) you query frequently and want fast responses. Setup involves an embedding service (OpenAI text-embedding-3, Voyage AI, or open-source sentence-transformers), a vector database (Pinecone, Weaviate, sqlite-vec, or pgvector), and an LLM endpoint. For one-off queries on a single document, RAG is overkill; for recurring queries on a corpus, RAG is the right architecture.

Citations

  1. Anthropic โ€” Claude context windows (200K standard, 1M extended)
  2. GPT-4 (Wikipedia) โ€” GPT-4 / GPT-4 Turbo context-window figures
  3. Google โ€” Gemini API and Google AI Studio updates (up to 2M-token context)
  4. LangChain โ€” Retrieval-Augmented Generation (RAG) tutorial

Clean PDF text extraction in your browser

ScoutMyTool PDF to text runs client-side. Extract text without uploading the source PDF โ€” useful for sensitive documents you want to query via LLM without exposing them broadly.

Open PDF-to-text โ†’