11 min read
PDF redaction guide — properly remove sensitive text (2026)
By ScoutMyTool Editorial Team · Last updated: 2026-05-20
Introduction
Every couple of years a major redaction failure makes the news — a court filing with black rectangles over the names that turned out to be copy-pasteable underneath; a government document where the redacted text could be revealed by selecting through the black overlay; a research paper where the patient identifiers had been "redacted" with a highlight rather than removed. Every one of those failures shares the same root cause: someone drew a rectangle and assumed the words underneath were gone. They were not. This article is the practical version of what redaction actually has to do, the seven failures that catch people the first time, the legal frameworks (HIPAA Safe Harbor, FOIA, NIST PII) that drive the requirement, and how to verify a redaction is genuinely permanent before shipping the file.
Seven ways "redaction" fails
Almost every redaction-gone-wrong falls into one of these seven patterns. Recognising them is the first step; the second is using a tool that does not enable them.
| Mistake | What it looks like | Why it fails |
|---|---|---|
| Drawing a black rectangle over the text | Using a draw / shape tool to paint a black box over the sensitive content. | The text underneath is still in the PDF content stream. Any reader can select the obscured text and copy-paste it out. The rectangle is a visual overlay, not a deletion. |
| Highlighting the text in black | Using the highlighter tool with a dark colour to "redact" the text. | Highlighter creates an annotation layer on top of the text — the text itself is unchanged. Removing or hiding annotations reveals the original content. |
| Changing the font colour to match the background | Setting the text colour to white on a white page so it appears invisible. | The text remains in the content stream and can be selected, copied, and extracted via pdftotext or any PDF parser regardless of colour. |
| Adding a sticker / image over the text | Pasting an image (a black rectangle, a "REDACTED" graphic) on top of the text. | Same as black rectangle — the text is still there underneath. Anyone who removes or repositions the image sees the original content. |
| Hiding text by moving it off-page | Repositioning text outside the visible page area but keeping it in the file. | Off-page text is still in the content stream. PDF readers and parsers can extract it regardless of its display coordinates. |
| Print-then-scan as redaction | Drawing on a printout with a marker, then scanning to PDF. | Marginally safer than rectangle-overlay because the result is a raster image with no text layer — but only if you do not run OCR on the scan. Run OCR and the redacted text may reappear if the marker was thin enough. |
| Forgetting metadata | Redacting page content but leaving document metadata (author, title, keywords, custom properties) intact. | Metadata can contain author names, organisation names, software versions, edit history, and original file paths that leak the same information you redacted from the visible content. |
What real redaction actually does
A PDF page is, technically, two things: a content stream (the sequence of glyph positions, vector graphics, and image references that paints the page) and an annotation layer (highlights, comments, signatures, and other overlays drawn on top of the content stream).1 A draw-a-rectangle "redaction" modifies the annotation layer — it adds a shape — and leaves the content stream untouched. Any PDF parser, reader, or text-extraction tool walks the content stream and sees the original text exactly as it was.
A proper redaction tool does two things, in order:
- Delete the targeted content from the page content stream. The glyphs are removed, not hidden. The PDF object structure no longer carries the sensitive text in any form.
- Paint a solid-colour rectangle over the now-empty area. The visual indicator that a redaction occurred — a black bar where the text used to be.
The combination is what makes the redaction permanent. Without step 1, the visible bar is theatre. Without step 2, the page has unexplained blank spaces and the reader cannot tell what was removed. ScoutMyTool's Permanent Redact PDF tool implements both steps. Adobe Acrobat Pro's Redact tool does the same. "Black rectangle" tools in non-redaction-aware PDF editors typically do not.
Metadata — the redaction blind spot
Even a perfect content-stream redaction leaks information if the document metadata still carries the original details. Five common metadata channels to scrub:
- Document properties — Title, Author, Subject, Keywords, Producer. Often carry a real name or organisation.
- XMP metadata — Extended properties (some PDFs carry edit history, application identifiers, custom fields). Less visible than document properties, sometimes more sensitive.
- Embedded files — Attached source documents that may carry the unredacted original. ZUGFeRD invoices, draft Word files, Excel exports.
- Form-field default values — AcroForm fields can carry default values that are not visible on the page but persist in the file.
- JavaScript and named destinations — Scripts and outline entries can reference original section titles or carry sensitive strings.
NIST Special Publication 800-122 lists metadata as a primary leakage vector for PII in document handling.2 Any redaction workflow that does not include a metadata scrub is incomplete by definition.
Six verification checks before shipping
Once a PDF is redacted, run these checks before releasing it. Catching a failure here costs you ten minutes; catching it after the file is public costs you considerably more.
| Check | Why | How |
|---|---|---|
| Try to select the redacted area | A proper redaction leaves no selectable text in the redacted region. | Open the PDF, drag-select across the redacted area. If text highlights and copies to clipboard, the redaction failed. |
| Run pdftotext on the file | pdftotext extracts every glyph in every content stream regardless of visual styling. | pdftotext redacted.pdf - | grep -i "<sensitive_term>" — should return zero matches if the redaction is real. |
| Inspect the document metadata | Author, title, subject, keywords, and producer fields commonly leak information unrelated to the visible page. | Open File → Properties in any PDF viewer; or pdfinfo redacted.pdf from the command line. Verify all fields are scrubbed or sanitised. |
| Check for embedded files | PDFs can carry embedded files (CSVs, source documents) as attachments, which often contain the unredacted source. | In Acrobat Reader: View → Show/Hide → Navigation Panes → Attachments. Or: pdfdetach -list redacted.pdf to enumerate attachments. |
| Check for hidden layers and form fields | Optional content (layers) and AcroForm field default values can carry information that is not visible by default. | View → Show/Hide → Navigation Panes → Layers, and Forms → Edit. Strip or audit any unexpected entries. |
| Verify JavaScript and named destinations | JavaScript actions can reveal hidden content or carry sensitive strings; named destinations can hint at original section names. | Acrobat Pro: Tools → JavaScript → Edit All JavaScripts. For a free check, pdfinfo includes JavaScript count; manually inspect via a PDF object viewer. |
HIPAA Safe Harbor — the 18 identifiers
For medical records released for research, analytics, or aggregated reporting, the HIPAA Privacy Rule's "Safe Harbor" method (45 CFR § 164.514(b)(2)) lists 18 specific identifiers that must be removed before the document is considered de-identified:3
- Names
- Geographic subdivisions smaller than a state (street, city, county, ZIP except first 3 digits in some cases)
- Dates more precise than year (birthdate, admission, discharge, death) for anyone over 89
- Telephone numbers
- Fax numbers
- Email addresses
- Social Security Numbers
- Medical Record Numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate or license numbers
- Vehicle identifiers and serial numbers
- Device identifiers and serial numbers
- Web URLs
- IP addresses
- Biometric identifiers (fingerprints, voiceprints)
- Full-face photographs and comparable images
- Any other unique identifying number, characteristic, or code
ScoutMyTool's HIPAA Redact tool runs pattern-based redaction tuned for these 18 categories. The "comparable images" and "any other unique identifying" categories require human review — no automated tool catches every variation.
A safe redaction workflow — six steps
- Make a working copy. Never redact the original. The original is your last line of defence if the redaction process produces an error and you need to start over.
- Mark the sensitive content. Use ScoutMyTool's Redact PDF tool to identify the text or regions to redact. For pattern-driven workflows, use Redact by Pattern; for HIPAA-specific workflows, use HIPAA Redact.
- Apply permanent redaction. The tool removes the targeted glyphs from the content stream and paints solid rectangles over the now-empty areas. The output PDF is your candidate for release.
- Scrub metadata. Open the document properties; clear or sanitise Title, Author, Subject, Keywords, and Producer. Remove any embedded files, JavaScript, or hidden layers that you did not intentionally include.
- Flatten the output. Run PDF Form Flatten to collapse all annotation and form layers into the page content stream. Removes any residual annotation-layer artefacts.
- Run the verification checks. Try to drag-select across redacted areas, run pdftotext and grep for sensitive terms, inspect metadata, check for embedded files. Only release the file after all checks pass.
Related ScoutMyTool tools and articles
- Redact PDF — interactive redaction tool.
- Permanent Redact PDF — explicit permanent redaction with content-stream removal.
- Redact by Pattern — regex-based bulk redaction.
- HIPAA Redact — tuned for the 18 Safe Harbor identifiers.
- PDF Metadata — inspect and scrub document metadata.
- PDF Form Flatten — collapse annotations and forms into static content.
- Protect PDF — password-protect the redacted output before sharing.
- Bates numbering for legal PDFs — for productions where redaction precedes Bates stamping.
- PDF/A conversion — for archiving the redacted output.
Frequently asked questions
- Why is drawing a black rectangle over text NOT a real redaction?
- Because the PDF text layer is independent of the visual overlay. When you draw a rectangle, you add a new shape to the page content stream above the text — but the text itself remains in the underlying content. A reader can select the obscured text by drag-clicking through the rectangle, the text highlights, and copy-paste extracts the original content. Same applies to highlight-as-redaction, image-over-text, and white-font-on-white-page. The rectangle is a curtain in front of the words; the words are still on the stage. Proper redaction permanently removes the underlying text from the content stream, then paints the now-empty area black.
- What does proper PDF redaction actually do?
- Two things, in order. First, the redaction tool removes the targeted content from the PDF content stream — the glyphs are deleted, not hidden. Second, the tool paints a solid-colour rectangle (typically black) over the now-empty area to indicate that a redaction occurred. The result: no selectable text, no extractable text, no recoverable content through any PDF parser. ScoutMyTool's Redact PDF and Permanent Redact PDF tools implement this two-step process; Adobe Acrobat Pro's Redact tool does the same. The key word for a real redaction is "permanent" — if the tool description says "draw" or "overlay" or "annotation", it is not a real redaction.
- What sensitive content commonly hides in PDF metadata?
- A surprising amount. The Title, Author, Subject, Keywords, and Producer fields can contain a real name, an organisation name, a software version, a project code, or a file path that gives away who created the document and on which machine. The Creation Date and Modification Date timestamps can place a document at a specific time. Custom properties (set by the original authoring tool) can carry anything the author embedded. PDFs from Word can carry the Word "comments" and "tracked changes" history; PDFs from InDesign can carry layer names and link paths. Every redaction workflow must include a metadata scrub.
- How does pattern-based redaction work?
- You provide a regular expression or a pattern (Social Security Number "###-##-####", credit card "#### #### #### ####", email "[a-z]+@[a-z]+\.[a-z]+"), and the tool finds every match across the document and redacts each occurrence. Useful for documents with predictable PII formats — a discovery production with thousands of pages of customer records, an HR document with embedded employee IDs, a medical record with dates and patient identifiers. ScoutMyTool's Redact by Pattern tool accepts regex patterns and produces a redacted output plus a report of every match it removed.
- What is HIPAA Safe Harbor de-identification, and how does it relate to PDF redaction?
- HIPAA's Privacy Rule (45 CFR § 164.514(b)(2)) lists 18 specific identifiers that must be removed for a record to be considered de-identified under the Safe Harbor method — including names, geographic subdivisions smaller than a state, dates more precise than year, telephone and fax numbers, email addresses, SSNs, MRNs, account numbers, biometric identifiers, photographs, and several others. For medical records distributed for research or analytics, every page of every record must have all 18 identifiers properly redacted (not just visually hidden) for the file to be safely shareable under Safe Harbor. ScoutMyTool's HIPAA Redact tool runs pattern-based redaction tuned for these 18 categories.
- How can I verify a redaction is actually permanent?
- Three checks. First, try to drag-select across the redacted region in a PDF viewer — if no text highlights, the visual layer is clean. Second, run pdftotext on the file (pdftotext redacted.pdf - | grep <sensitive_term>) — if grep returns no matches, the content stream is clean. Third, inspect the document metadata with pdfinfo and check for any author / keyword / producer fields that might still leak information. Pass all three and the redaction is real; fail any one and the redaction is incomplete.
- Should I "flatten" the PDF after redacting?
- Yes, as a safety belt. Flattening collapses annotations, form fields, and layers into the underlying page content, eliminating any chance that a hidden layer or annotation carries the redacted information. A redaction tool that does its job correctly already removes the underlying content, but flattening after the fact is cheap insurance and a common requirement in legal-production protocols. ScoutMyTool's PDF Form Flatten tool handles this step; the result is a single static page-content layer that no further editing can undo.
Redact your PDF properly, free
Browser-based permanent redaction — content removed from the page content stream, not just overlaid. Nothing is uploaded.