How to Replace text in PDF using Python

Question

I am trying to replace some text in python and have the following code which is updating the content of the PDF correctly till the time it is in memory, but overwrites it with original content on writing to a file :

def replace_text(content, replacements = dict()):
    content=content.replace("<NAME>","Test")
    return content


def process_data(content,text, replacements):
    data = content.get_data()
    decoded_data = data.decode('utf-8')
    replaced_data = replace_text(text, replacements)
    

    encoded_data = replaced_data.encode('utf-8')
    # print(encoded_data)
    
    if content.decoded_self is not None:
        content.decoded_self.set_data(encoded_data)
    else:
        content.set_data(replaced_data)

def get_pdf_encoding(pdf_reader):
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        font_objects = page['/Resources']['/Font']

        for font_key in font_objects.keys():
            font_obj = font_objects[font_key]
            encoding = font_obj.get('/Encoding')
            if encoding:
                return encoding

    return None

if __name__ == "__main__":
    
    in_file ="Certificate.pdf"
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = {
        "<NAME>": "John Doe",
    # Add other placeholders and their replacements here
    }

    pdf = PdfReader(in_file)
    
    encoding=get_pdf_encoding(pdf)
    print(encoding)
    for page_number in range(0, len(pdf.pages)):

        page = pdf.pages[page_number]
        text = page.extract_text()
        contents=page.get_contents()        

        # process_data(contents,text, replacements,encoding)

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents,text, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj,text, replacements)

        page[NameObject("/Contents")] = contents
        print(page.get_contents().get_data())
        writer = PdfWriter()
        writer.add_page(page)

        with open("updatedcertificate5.pdf", 'wb') as out_file:
            writer.write(out_file)

When I print the data in page at the end of this code, I am getting the updated value, however, after writing the data to PDF, it reverts back to the original pdf content. Can someone highlight what might be the issue or what I can do differently to get this working? Here is a screenshot of my file directory:

Can we see your directory structure? With the new files and old files and all? Are you sure you are opening the right file? — ViaTech, Jul 30 '23 at 14:23
yes, I deleted recreated the file multiple times. It is not getting updated. Updated question with a snippet of my file directory — user2586942, Jul 30 '23 at 14:46
Okay, if you have not figured out a solution, I will post one. — ViaTech, Aug 04 '23 at 18:48

ViaTech · Answer 1 · 2023-08-08T21:25:45.560

There are many ways to accomplish this, but here is a simple solution that shows you how to replace text from a PDF you read in, and then write the data out to a new file with the replacements added.

I have 2 examples here. One uses PyMuPDF and the other uses pypdf and fpdf.

First in whichever directory you are running your code from place this PDF, and name the file test.pdf:

Test Replacement
Name: {{ name }}

Now, here is the code to change {{ name }} to whatever you want.

Using PyMuPDF

https://pypi.org/project/PyMuPDF/

I was testing out this package earlier, and the process of altering text in a PDF is very easy (likely easier than most other options I've seen), so I figured I'd show a quick 10 line solution using PyMuPDF. Basically it is 3 steps. 1) I find the search term, 2) then I redact the text, 3) then I insert new text in the same location. This keeps all formatting as it was too. This package as far as I can tell from the documentation, seems pretty robust.

import fitz  # import PyMuPDF...for whatever reason it is called fitz

doc = fitz.open("test.pdf")  # the file with the text you want to change
search_term = "{{ name }}"
for page in doc:
    found = page.search_for(search_term)  # list of rectangles where to replace
    for item in found:
        page.add_redact_annot(item, '')  # create redaction for text
        page.apply_redactions()  # apply the redaction now
        page.insert_text(item.bl - (0, 3), "We Changed The Text Using PyMuPDF!")

doc.save("text_change.pdf")

Using PyPDF and FPDF

from fpdf import FPDF
import pypdf

# Open sample pdf to replace text in pdf
pdf_file = pypdf.PdfReader(open("test.pdf", "rb"))

# get the text needed
page = pdf_file.pages[0]
text = page.extract_text()

# replace data needed from the file
text = text.replace("{{ name }}", "Whatever Name You Want").encode('utf-8').decode('utf-8')

# Create PDF object
pdf = FPDF()
# Write text pages to PDF
pdf.set_font("Arial", size=12)
pdf.add_page()
pdf.multi_cell(200, 10, txt=text, align="L")
# Save PDF in same directory, but now it is filled
pdf.output("output_file.pdf", "F")

Now in the same directory you have a file named output_file.pdf and it looks like this:

Test Replacement
Name: Whatever Name You Want

Thanks @ViaTech for posting the solution. But will this not override the formatting of the original PDF? — user2586942, Aug 07 '23 at 16:21
@user2586942 I have added a new solution with PyMuPDF that keeps the formatting. It is a newer package that I just messed around with today — ViaTech, Aug 08 '23 at 17:07

How to Replace text in PDF using Python

1 Answers1