How can I determine whether a PDF page contains redacted material?

Asked Aug 08 '19 at 18:12

Active Aug 08 '19 at 18:12

Viewed 306 times

I have a set of PDFs, for which some pages have had partial contents redacted through Adobe Acrobat. I would like to programmatically iterate through each page and determine whether the page contains redacted content, preferably using Python (note that I'm not having trouble iterating through the PDF pages, just determining presence of redacted content).

I've used PyMuPDF's getText() function to check for any "ghost" indicators in the PDF's text layer of redacted space, but there doesn't seem to be any clues. I'm wondering if there's any other data hiding in the PDF that I could extract that would point to a redaction layer.

asked Aug 08 '19 at 18:12

crkm

If the PDFs were redacted using redaction codes where the redaction was performed, you could look for those in the text. Otherwise, you won't be able to locate any text that indicates a redaction since removing the text is the entire point of redaction. – joelgeraci Aug 08 '19 at 21:04
I'm trying to do something similar, I am trying to extract text from pdfs with redactions and would like to note REDACTED if a redaction exists. Did you have any luck with finding a way to do this in python? – chia berry Jul 24 '20 at 19:25

How can I determine whether a PDF page contains redacted material?

0 Answers0