7 min read
PDF compression: the science behind how it works
By ScoutMyTool Editorial Team ยท Last updated: 2026-05-21
Introduction
For years I treated โcompress PDFโ as a magic button: press it, the file gets smaller, do not ask why. Then a clientโs 80 MB contract refused to shrink below 78 MB no matter what I tried, and a colleagueโs scanned report compressed 40-to-1, and I finally sat down to understand what is actually happening inside the file. It turns out PDF compression is not one algorithm but a toolkit, and which tool applies depends entirely on what is on the page โ text, vectors, photographs, or scans. This article explains the science in plain terms: the lossless and lossy methods PDFs use, why scans behave so differently from text, and why some files are already as small as they will ever get.
The compression methods inside a PDF
A PDF stores different kinds of content in separate streams, and each stream can use a different compression filter. These are the workhorses and what each is for.
| Method | Type | Applies to | Trade-off |
|---|---|---|---|
| Flate / DEFLATE | Lossless | Text, vectors, metadata, PNG-like images | No quality loss; modest ratio on already-dense data |
| JPEG (DCT) | Lossy | Color & grayscale photographs | Large savings; artifacts rise as quality drops |
| JPEG 2000 | Both | High-quality images needing fine control | Better quality per byte; less universally supported |
| JBIG2 | Both | Black-and-white scanned pages | Huge ratios on scans; lossy mode can swap glyphs |
| CCITT Group 4 | Lossless | Bitonal (1-bit) fax-style scans | Reliable for line art; only for 1-bit images |
| Image downsampling | Lossy | Over-resolution images | Big wins when DPI exceeds need; blurs if overdone |
| Font subsetting | Lossless | Embedded fonts | Keeps only used glyphs; safe, often overlooked |
How the pieces fit together
Every compression scheme works by removing two things: redundancy (patterns that repeat and can be described once) and, in lossy modes, perceptually spare detail (information the eye barely registers). Lossless methods like DEFLATE only do the first โ they find repeated byte sequences and replace them with short references, which is why text and vector data, full of repetition, shrink well and reconstruct perfectly. Lossy methods like JPEG add the second: they transform an image into frequency components and discard the high-frequency detail humans see least, trading exactness for dramatic size cuts. A good PDF compressor routes each part of the document to the right method instead of applying one blunt setting to everything.
Step by step โ compress a PDF intelligently
- Diagnose what is making it big. Before compressing, find out whether the weight is images, fonts, or bloat. Use Image Quality Analysis to see image resolutions โ over-resolution images are usually the cause.
- Choose lossless first. Run a lossless pass that recompresses streams with Flate, subsets fonts, and strips unused objects. This shrinks the file with zero visible change โ the safe default for archival and legal documents.
- Downsample over-resolution images. Target ~300 DPI for print and ~150 DPI for screen-only documents. Removing pixels no one will see is often the single largest win, and it does not touch the text.
- Apply lossy image compression by purpose. For email and web, allow JPEG re-encoding at a moderate quality. For anything where image fidelity matters, keep the quality high or skip lossy entirely. Compress with Compress PDF.
- Handle scans separately. For a black-and-white scan, bitonal methods (CCITT/JBIG2) beat JPEG by far. For documents where every character must be exact, avoid aggressive lossy JBIG2 symbol substitution.
- Verify quality after. Open the result and inspect the images and small text at 100%. If artifacts show, step the quality back up; if it still looks perfect, you may be able to push harder.
- Keep a master copy. Store the uncompressed original so you can re-derive a different-quality copy later โ re-compressing an already-lossy file only degrades it.
Related reading and tools
- Compress a PDF: the practical how-to companion to this explainer.
- Compress photo-heavy PDFs: where lossy image compression earns its keep.
- Sharing without losing quality: balancing size against fidelity.
- Embedding and subsetting fonts: the overlooked lossless saving.
- Extract images from a PDF: inspecting what is actually inside.
- Compress PDF tool: apply all of this in your browser.
- All ScoutMyTool PDF tools: the full toolkit.
FAQ
- Why are some PDFs huge and others tiny for the same number of pages?
- Size is driven by content type, not page count. A 50-page text report is mostly characters and vector lines, which compress extremely well with lossless Flate โ it might be under a megabyte. A 5-page brochure full of high-resolution photographs can be tens of megabytes, because photographic pixels carry far more information than text. Scanned documents are the wild card: a scan is an image of text, so a 20-page scan can be larger than a 500-page born-digital book. When a PDF is unexpectedly large, the culprit is almost always images โ either too many, at too high a resolution, or stored with weak compression.
- What is the difference between lossless and lossy compression?
- Lossless compression rewrites data in a more compact form that can be perfectly reconstructed โ nothing is discarded, so text stays crisp and vectors stay sharp. Flate/DEFLATE, used for text streams and PNG-style images, is lossless. Lossy compression throws away information the eye is least likely to miss, which is how JPEG shrinks photographs so dramatically. The trade-off is permanence: once you save a lossy image at low quality, the discarded detail is gone, and re-compressing only degrades it further. The art of PDF compression is applying lossless methods to text and vectors and reserving lossy methods for the photographic images that can spare the detail.
- How does compressing a scanned PDF work?
- A scan is a raster image, so it uses image compression, not text compression โ which is why OCR does not shrink a scan (OCR adds a text layer but the picture stays). Black-and-white scans use JBIG2 or CCITT Group 4, which are tuned for pages of dark marks on a light background and can achieve very high ratios. JBIG2 has a lossy mode that groups similar-looking symbols and reuses one image for all of them; it compresses brilliantly but has, in rare documented cases, swapped visually similar characters such as digits โ so for documents where every character must be exact, prefer a lossless or conservative setting. Color scans fall back to JPEG.
- Does downsampling images really help, and when does it hurt?
- It helps a lot when an image has more resolution than its display or print use requires. A photo placed at postcard size but stored at 600 DPI carries roughly four times the pixels of the same image at 300 DPI, and far more than a screen at ~150 DPI can show. Downsampling to the resolution actually needed removes pixels no one will see, often the single biggest size win. It hurts when you downsample below the output resolution โ print at 72 DPI looks blocky, and zoomable detail is lost. The rule of thumb: ~300 DPI for print, ~150 DPI for screen, and never upsample.
- Why did my PDF barely shrink when I compressed it?
- Usually one of three reasons. First, it is already efficiently compressed โ re-running compression on an optimized text PDF or already-JPEG images yields little, because the redundancy is gone. Second, it is mostly text and vectors, which are already small; there is simply not much to remove without going lossy on things that should stay lossless. Third, the images are already at low resolution, so downsampling has nothing to trim. Compression is not magic: it removes redundancy and visually spare detail. If neither is present in quantity, the file is close to its floor and a tiny reduction is the honest result.
- Will compressing a PDF reduce its quality every time?
- Only if it applies lossy methods. Pure lossless optimization โ recompressing streams with Flate, subsetting fonts, removing unused objects and duplicate resources โ reduces size with zero visible change. Quality loss enters only when images are re-encoded with lossy JPEG at a lower quality or downsampled. Good compressors let you choose: a lossless or high-quality pass for archival and legal documents, a more aggressive lossy pass for email and web. So the answer depends on the settings โ compression and quality loss are related but not the same thing, and you control the balance.
- Is it safe to compress a confidential PDF with an online tool?
- Only if the tool processes the file locally in your browser. Many online compressors upload your document to a server to process it, which is a problem for contracts, medical records, or anything sensitive. ScoutMyTool compresses entirely client-side in your browser tab, so the file never leaves your machine while being optimized. For any document you would not post publicly, confirm the tool does not upload before using it, or use an offline desktop application.
Citations
- IETF RFC 1951 โ โDEFLATE Compressed Data Format Specification,โ the lossless algorithm behind PDFโs Flate filter. datatracker.ietf.org/doc/html/rfc1951
- Wikipedia โ โJPEGโ (ISO/IEC 10918), the lossy DCT-based method used for photographic images in PDFs. en.wikipedia.org/wiki/JPEG
- Wikipedia โ โJBIG2,โ the bitonal scan-compression method, including the documented symbol-substitution risk in lossy mode. en.wikipedia.org/wiki/JBIG2
Put the science to work
ScoutMyToolโs compressor downsamples images, subsets fonts, and recompresses streams entirely in your browser tab โ your file never leaves your machine while it shrinks.
Open Compress PDF โ