-1

I am trying to replace some text in python and have the following code which is updating the content of the PDF correctly till the time it is in memory, but overwrites it with original content on writing to a file :

def replace_text(content, replacements = dict()):
    content=content.replace("<NAME>","Test")
    return content


def process_data(content,text, replacements):
    data = content.get_data()
    decoded_data = data.decode('utf-8')
    replaced_data = replace_text(text, replacements)
    

    encoded_data = replaced_data.encode('utf-8')
    # print(encoded_data)
    
    if content.decoded_self is not None:
        content.decoded_self.set_data(encoded_data)
    else:
        content.set_data(replaced_data)

def get_pdf_encoding(pdf_reader):
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        font_objects = page['/Resources']['/Font']

        for font_key in font_objects.keys():
            font_obj = font_objects[font_key]
            encoding = font_obj.get('/Encoding')
            if encoding:
                return encoding

    return None

if __name__ == "__main__":
    
    in_file ="Certificate.pdf"
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = {
        "<NAME>": "John Doe",
    # Add other placeholders and their replacements here
    }

    pdf = PdfReader(in_file)
    
    encoding=get_pdf_encoding(pdf)
    print(encoding)
    for page_number in range(0, len(pdf.pages)):

        page = pdf.pages[page_number]
        text = page.extract_text()
        contents=page.get_contents()        

        # process_data(contents,text, replacements,encoding)

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents,text, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj,text, replacements)

        page[NameObject("/Contents")] = contents
        print(page.get_contents().get_data())
        writer = PdfWriter()
        writer.add_page(page)

        with open("updatedcertificate5.pdf", 'wb') as out_file:
            writer.write(out_file)

When I print the data in page at the end of this code, I am getting the updated value, however, after writing the data to PDF, it reverts back to the original pdf content. Can someone highlight what might be the issue or what I can do differently to get this working? Here is a screenshot of my file directory:

enter image description here

user2586942
  • 73
  • 1
  • 10

1 Answers1

1

There are many ways to accomplish this, but here is a simple solution that shows you how to replace text from a PDF you read in, and then write the data out to a new file with the replacements added.

I have 2 examples here. One uses PyMuPDF and the other uses pypdf and fpdf.

First in whichever directory you are running your code from place this PDF, and name the file test.pdf:

Test Replacement
Name: {{ name }}

Now, here is the code to change {{ name }} to whatever you want.

Using PyMuPDF

https://pypi.org/project/PyMuPDF/

I was testing out this package earlier, and the process of altering text in a PDF is very easy (likely easier than most other options I've seen), so I figured I'd show a quick 10 line solution using PyMuPDF. Basically it is 3 steps. 1) I find the search term, 2) then I redact the text, 3) then I insert new text in the same location. This keeps all formatting as it was too. This package as far as I can tell from the documentation, seems pretty robust.

import fitz  # import PyMuPDF...for whatever reason it is called fitz

doc = fitz.open("test.pdf")  # the file with the text you want to change
search_term = "{{ name }}"
for page in doc:
    found = page.search_for(search_term)  # list of rectangles where to replace
    for item in found:
        page.add_redact_annot(item, '')  # create redaction for text
        page.apply_redactions()  # apply the redaction now
        page.insert_text(item.bl - (0, 3), "We Changed The Text Using PyMuPDF!")

doc.save("text_change.pdf")

Using PyPDF and FPDF

from fpdf import FPDF
import pypdf

# Open sample pdf to replace text in pdf
pdf_file = pypdf.PdfReader(open("test.pdf", "rb"))

# get the text needed
page = pdf_file.pages[0]
text = page.extract_text()

# replace data needed from the file
text = text.replace("{{ name }}", "Whatever Name You Want").encode('utf-8').decode('utf-8')

# Create PDF object
pdf = FPDF()
# Write text pages to PDF
pdf.set_font("Arial", size=12)
pdf.add_page()
pdf.multi_cell(200, 10, txt=text, align="L")
# Save PDF in same directory, but now it is filled
pdf.output("output_file.pdf", "F")

Now in the same directory you have a file named output_file.pdf and it looks like this:

Test Replacement
Name: Whatever Name You Want
ViaTech
  • 2,143
  • 1
  • 16
  • 51
  • Thanks @ViaTech for posting the solution. But will this not override the formatting of the original PDF? – user2586942 Aug 07 '23 at 16:21
  • @user2586942 I have added a new solution with PyMuPDF that keeps the formatting. It is a newer package that I just messed around with today – ViaTech Aug 08 '23 at 17:07