Redact certain words from PDF-file in Python

Question

As the title suggests, i am looking for a way to read a PDF, redact certain words (make them black) and save the PDF-file. I think its possible, I just don't know how. Any help/tips are highly appreciated!

score 0 · Answer 1 · answered Sep 17 '22 at 13:12

Disclaimer: I am the author of borb the library used in this answer.

The general idea for this solution is to determine which rectangular areas need to be redacted. And then, in a second step, apply those redactions.

The first step can be achieved using RegularExpressionTextExtraction. This class looks through an entire PDF, one page at a time, matching a regular expression. It then spits out a list of matches (containing the rectangular area they matched).

Here's an example of that particular code.

# read the Document
# fmt: off
doc: typing.Optional[Document] = None
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[lL]orem .* [dD]olor")
with open("input.pdf", "rb") as in_file_handle:
    doc = PDF.loads(in_file_handle, [l])
# fmt: on

# check whether we have read a Document
assert doc is not None

# print matching groups
for i, m in enumerate(l.get_matches_for_page(0)):
    print("%d %s" % (i, m.group(0)))
    for r in m.get_bounding_boxes():
        print(
            "\t%f %f %f %f" % (r.get_x(), r.get_y(), r.get_width(), r.get_height())
        )

Next up is adding a RedactionAnnotation to each Page.

page.add_annotation(
    RedactAnnotation(
        Rectangle(Decimal(405), Decimal(721), Decimal(40), Decimal(8)).grow(
            Decimal(2)
        )
    )
)

# store
with open("output.pdf", "wb") as out_file_handle:
    PDF.dumps(out_file_handle, doc)

Now if you want to actually remove the contents, you can simply apply the annotations.

I do get an Output.pdf but i don't see any black highlighted rectangles. Is that some code that has to be added? — Dion, Oct 09 '22 at 19:08

Redact certain words from PDF-file in Python

1 Answers1