How to match placement,font and size of replaced text with search text in PDF files using Python?

Question

I'm using Python and the PyMuPDF library to search for and replace text in PDF files. The code I have is able to successfully search for and replace the text, but the font and size of the replaced text is different from the search text. I want the replaced text to have the same font and size as the search text. Can someone please help me modify my code to achieve this?

Here's the code I'm currently using:


import os
import fitz

# Prompt user for input of file name
file_name_input = input("Enter a start or text of the file name: ")

# Get a list of PDF files in the current directory matching the file name input
pdf_files = [f for f in os.listdir() if f.lower().endswith('.pdf') and file_name_input.lower() in f.lower()]

if not pdf_files:
    print("No PDF files found matching the file name input")
else:
    # Prompt user for input of search and replace text
    search_replace_list = []
    while True:
        search_text = input("Enter the search text (leave blank to exit): ")
        if not search_text:
            break
        replace_text = input("Enter the replace text: ")
        search_replace_list.append((search_text, replace_text))

    for file_name in pdf_files:
        pdf_file = fitz.open(file_name)
        found = False
        for page in pdf_file:
            for search_text, replace_text in search_replace_list:
                draft = page.search_for(search_text.strip(), hit_max=16, quads=True, quads_tol=0.01)
                if draft:
                    found = True
                    for rect in draft:
                        annot = page.add_redact_annot(rect, text=replace_text)
                    page.apply_redactions()
                    page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)

        if found:
            output_file_name = file_name[:-4] + '_modified.pdf'
            pdf_file.save(output_file_name, garbage=False, deflate=True, encryption=False)
            print(f"Changes saved to {output_file_name}")
        else:
            print(f"No search text found in {file_name}")

        pdf_file.close()

Blank replace text is fine, it whites out or removes the search text but if I input something in replace text it is of a different font and does not replace the exact placement, font and size of the search text.

There is one important thing to bear in mind here. Other than the 3 fundamental Postscript fonts (Helvetica, Times, Courier), all of the font files needed for a document are embedded in the PDF, and in most cases they only embed the characters that are needed for the document. Thus, you might not HAVE the characters you need for arbitrary text. — Tim Roberts, Apr 29 '23 at 18:43
As @TimRoberts said, one problem is to get hold of the same, complete font. because the embedded one very probably is a subset which has a fat chance not to contain all the characters in the replacing text. If you do have a suitable font, then you must find the correct start point for the insertion: this is not the bottom of the search hit rect because of the font descender value. So your hit rect height is `fontsize * (font_ascender - font_descender)`. So you insertion point is `fontsize*font_descender` higher than `rect.bl.y`. Do the insertion after the redaction has been applied. — Jorj McKie, Apr 29 '23 at 20:41

How to match placement,font and size of replaced text with search text in PDF files using Python?

0 Answers0