PDF compression: the science behind how it works

How PDF compression actually works โ€” lossless Flate for text, lossy JPEG for photos, JBIG2/CCITT for scans, downsampling, font subsetting, and why some PDFs barely shrink.

7 min read

PDF compression: the science behind how it works

By ScoutMyTool Editorial Team ยท Last updated: 2026-05-21

Introduction

For years I treated โ€œcompress PDFโ€ as a magic button: press it, the file gets smaller, do not ask why. Then a clientโ€™s 80 MB contract refused to shrink below 78 MB no matter what I tried, and a colleagueโ€™s scanned report compressed 40-to-1, and I finally sat down to understand what is actually happening inside the file. It turns out PDF compression is not one algorithm but a toolkit, and which tool applies depends entirely on what is on the page โ€” text, vectors, photographs, or scans. This article explains the science in plain terms: the lossless and lossy methods PDFs use, why scans behave so differently from text, and why some files are already as small as they will ever get.

The compression methods inside a PDF

A PDF stores different kinds of content in separate streams, and each stream can use a different compression filter. These are the workhorses and what each is for.

MethodTypeApplies toTrade-off
Flate / DEFLATELosslessText, vectors, metadata, PNG-like imagesNo quality loss; modest ratio on already-dense data
JPEG (DCT)LossyColor & grayscale photographsLarge savings; artifacts rise as quality drops
JPEG 2000BothHigh-quality images needing fine controlBetter quality per byte; less universally supported
JBIG2BothBlack-and-white scanned pagesHuge ratios on scans; lossy mode can swap glyphs
CCITT Group 4LosslessBitonal (1-bit) fax-style scansReliable for line art; only for 1-bit images
Image downsamplingLossyOver-resolution imagesBig wins when DPI exceeds need; blurs if overdone
Font subsettingLosslessEmbedded fontsKeeps only used glyphs; safe, often overlooked

How the pieces fit together

Every compression scheme works by removing two things: redundancy (patterns that repeat and can be described once) and, in lossy modes, perceptually spare detail (information the eye barely registers). Lossless methods like DEFLATE only do the first โ€” they find repeated byte sequences and replace them with short references, which is why text and vector data, full of repetition, shrink well and reconstruct perfectly. Lossy methods like JPEG add the second: they transform an image into frequency components and discard the high-frequency detail humans see least, trading exactness for dramatic size cuts. A good PDF compressor routes each part of the document to the right method instead of applying one blunt setting to everything.

Step by step โ€” compress a PDF intelligently

  1. Diagnose what is making it big. Before compressing, find out whether the weight is images, fonts, or bloat. Use Image Quality Analysis to see image resolutions โ€” over-resolution images are usually the cause.
  2. Choose lossless first. Run a lossless pass that recompresses streams with Flate, subsets fonts, and strips unused objects. This shrinks the file with zero visible change โ€” the safe default for archival and legal documents.
  3. Downsample over-resolution images. Target ~300 DPI for print and ~150 DPI for screen-only documents. Removing pixels no one will see is often the single largest win, and it does not touch the text.
  4. Apply lossy image compression by purpose. For email and web, allow JPEG re-encoding at a moderate quality. For anything where image fidelity matters, keep the quality high or skip lossy entirely. Compress with Compress PDF.
  5. Handle scans separately. For a black-and-white scan, bitonal methods (CCITT/JBIG2) beat JPEG by far. For documents where every character must be exact, avoid aggressive lossy JBIG2 symbol substitution.
  6. Verify quality after. Open the result and inspect the images and small text at 100%. If artifacts show, step the quality back up; if it still looks perfect, you may be able to push harder.
  7. Keep a master copy. Store the uncompressed original so you can re-derive a different-quality copy later โ€” re-compressing an already-lossy file only degrades it.

FAQ

Why are some PDFs huge and others tiny for the same number of pages?
Size is driven by content type, not page count. A 50-page text report is mostly characters and vector lines, which compress extremely well with lossless Flate โ€” it might be under a megabyte. A 5-page brochure full of high-resolution photographs can be tens of megabytes, because photographic pixels carry far more information than text. Scanned documents are the wild card: a scan is an image of text, so a 20-page scan can be larger than a 500-page born-digital book. When a PDF is unexpectedly large, the culprit is almost always images โ€” either too many, at too high a resolution, or stored with weak compression.
What is the difference between lossless and lossy compression?
Lossless compression rewrites data in a more compact form that can be perfectly reconstructed โ€” nothing is discarded, so text stays crisp and vectors stay sharp. Flate/DEFLATE, used for text streams and PNG-style images, is lossless. Lossy compression throws away information the eye is least likely to miss, which is how JPEG shrinks photographs so dramatically. The trade-off is permanence: once you save a lossy image at low quality, the discarded detail is gone, and re-compressing only degrades it further. The art of PDF compression is applying lossless methods to text and vectors and reserving lossy methods for the photographic images that can spare the detail.
How does compressing a scanned PDF work?
A scan is a raster image, so it uses image compression, not text compression โ€” which is why OCR does not shrink a scan (OCR adds a text layer but the picture stays). Black-and-white scans use JBIG2 or CCITT Group 4, which are tuned for pages of dark marks on a light background and can achieve very high ratios. JBIG2 has a lossy mode that groups similar-looking symbols and reuses one image for all of them; it compresses brilliantly but has, in rare documented cases, swapped visually similar characters such as digits โ€” so for documents where every character must be exact, prefer a lossless or conservative setting. Color scans fall back to JPEG.
Does downsampling images really help, and when does it hurt?
It helps a lot when an image has more resolution than its display or print use requires. A photo placed at postcard size but stored at 600 DPI carries roughly four times the pixels of the same image at 300 DPI, and far more than a screen at ~150 DPI can show. Downsampling to the resolution actually needed removes pixels no one will see, often the single biggest size win. It hurts when you downsample below the output resolution โ€” print at 72 DPI looks blocky, and zoomable detail is lost. The rule of thumb: ~300 DPI for print, ~150 DPI for screen, and never upsample.
Why did my PDF barely shrink when I compressed it?
Usually one of three reasons. First, it is already efficiently compressed โ€” re-running compression on an optimized text PDF or already-JPEG images yields little, because the redundancy is gone. Second, it is mostly text and vectors, which are already small; there is simply not much to remove without going lossy on things that should stay lossless. Third, the images are already at low resolution, so downsampling has nothing to trim. Compression is not magic: it removes redundancy and visually spare detail. If neither is present in quantity, the file is close to its floor and a tiny reduction is the honest result.
Will compressing a PDF reduce its quality every time?
Only if it applies lossy methods. Pure lossless optimization โ€” recompressing streams with Flate, subsetting fonts, removing unused objects and duplicate resources โ€” reduces size with zero visible change. Quality loss enters only when images are re-encoded with lossy JPEG at a lower quality or downsampled. Good compressors let you choose: a lossless or high-quality pass for archival and legal documents, a more aggressive lossy pass for email and web. So the answer depends on the settings โ€” compression and quality loss are related but not the same thing, and you control the balance.
Is it safe to compress a confidential PDF with an online tool?
Only if the tool processes the file locally in your browser. Many online compressors upload your document to a server to process it, which is a problem for contracts, medical records, or anything sensitive. ScoutMyTool compresses entirely client-side in your browser tab, so the file never leaves your machine while being optimized. For any document you would not post publicly, confirm the tool does not upload before using it, or use an offline desktop application.

Citations

  1. IETF RFC 1951 โ€” โ€œDEFLATE Compressed Data Format Specification,โ€ the lossless algorithm behind PDFโ€™s Flate filter. datatracker.ietf.org/doc/html/rfc1951
  2. Wikipedia โ€” โ€œJPEGโ€ (ISO/IEC 10918), the lossy DCT-based method used for photographic images in PDFs. en.wikipedia.org/wiki/JPEG
  3. Wikipedia โ€” โ€œJBIG2,โ€ the bitonal scan-compression method, including the documented symbol-substitution risk in lossy mode. en.wikipedia.org/wiki/JBIG2

Put the science to work

ScoutMyToolโ€™s compressor downsamples images, subsets fonts, and recompresses streams entirely in your browser tab โ€” your file never leaves your machine while it shrinks.

Open Compress PDF โ†’