0

I am struggling to remove text from a pdf file. I know this can be performed manually with PDF editors but I have a few PDF files to modify. The code I have so far is able to recognise all the text in a pdf file but dpes not remove th text when it is re-written as the output file.

here is the script I tried EDIT: I do not want to redact a pdf I want to remove the text from the PDF

import PyPDF2

# Open the PDF file in read-binary mode
with open('C:/inputput.pdf', 'rb') as pdf_file:
    # Create a PdfFileReader object to read the PDF file
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # Get the first page of the PDF file
    page = pdf_reader.getPage(0)

    # Get the page's content as a string
    page_content = page.extractText()

    # Replace the text to be removed with an empty string
    modified_content = page_content.replace('words to replace', '')

    # If the entire text box was removed, remove the text box itself
    if not modified_content.strip():
        page.getContents().getObject().update({PyPDF2.utils.b_("Filter"): PyPDF2.utils.b_("FlateDecode"), PyPDF2.utils.b_("Length"): 0})
    
    # Replace the page's content with the modified content

    # Create a PdfFileWriter object to write the modified PDF to a new file
    pdf_writer = PyPDF2.PdfFileWriter()
    pdf_writer.addPage(page)

    # Save the modified PDF to a new file
    with open('output_file22.pdf', 'wb') as output_file:
        pdf_writer.write(output_file)

1 Answers1

0

disclaimer: I am the author of borb, the library used in this answer.

Removing text from a PDF is called redaction. This should help you along when you're googling the problem.

If you're open to using another library, you can do this using borb with the following code:

#!chapter_005/src/snippet_006.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction
from borb.pdf.canvas.layout.annotation.redact_annotation import RedactAnnotation

def main():

    # read the Document
    # extract all occurences of a regular expression
    doc: typing.Optional[Document] = None
    l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[lL]orem .* [dD]olor")
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l])

    # add a redaction annotation for each matching group
    # this code only does so on page 0, but you can of course modify that
    for i, m in enumerate(l.get_matches()[0]):
        for r in m.get_bounding_boxes():
            page.add_annotation(RedactAnnotation(r))

    # apply redaction annotations
    # Redaction is a two-step process according to the PDF specification.
    # This allows multiple users to collaborate on redacting a document;
    # you can still see where others users want to remove text all
    # the way up until you apply the redaction annotations.
    doc.get_page(0).apply_redact_annotations()

    # store the modified PDF
    with open("output.pdf", "wb") as out_file_handle:
        PDF.dumps(out_file_handle, doc)

if __name__ == "__main__":
    main()

To learn more about redaction and borb, check out the examples repository here.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
  • thank for you feedback, The code you supplied does not work , it gives the following error: File "C:\ProgramData\Anaconda3\lib\site-packages\borb\pdf\canvas\geometry\rectangle.py", line 25, in __init__ assert width >= 0, "A Rectangle must have a non-negative width." AssertionError: A Rectangle must have a non-negative width. I do not want to redact text from a PDF i want to remove it completely – Ryno Smith Mar 08 '23 at 15:34
  • Adding redaction annotations, applying those annotations, and then saving the modified file will remove the text. – Joris Schellekens Mar 08 '23 at 19:01
  • Can you share the PDF you're using? – Joris Schellekens Mar 22 '23 at 10:30