As the title suggests, i am looking for a way to read a PDF, redact certain words (make them black) and save the PDF-file. I think its possible, I just don't know how. Any help/tips are highly appreciated!
Asked
Active
Viewed 203 times
1 Answers
0
Disclaimer: I am the author of borb
the library used in this answer.
The general idea for this solution is to determine which rectangular areas need to be redacted. And then, in a second step, apply those redactions.
The first step can be achieved using RegularExpressionTextExtraction
. This class looks through an entire PDF, one page at a time, matching a regular expression. It then spits out a list of matches (containing the rectangular area they matched).
Here's an example of that particular code.
# read the Document
# fmt: off
doc: typing.Optional[Document] = None
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[lL]orem .* [dD]olor")
with open("input.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# fmt: on
# check whether we have read a Document
assert doc is not None
# print matching groups
for i, m in enumerate(l.get_matches_for_page(0)):
print("%d %s" % (i, m.group(0)))
for r in m.get_bounding_boxes():
print(
"\t%f %f %f %f" % (r.get_x(), r.get_y(), r.get_width(), r.get_height())
)
Next up is adding a RedactionAnnotation
to each Page
.
page.add_annotation(
RedactAnnotation(
Rectangle(Decimal(405), Decimal(721), Decimal(40), Decimal(8)).grow(
Decimal(2)
)
)
)
# store
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, doc)
Now if you want to actually remove the contents, you can simply apply the annotations.

Joris Schellekens
- 8,483
- 2
- 23
- 54
-
I do get an Output.pdf but i don't see any black highlighted rectangles. Is that some code that has to be added? – Dion Oct 09 '22 at 19:08