1

I'm using Python and the PyMuPDF library to search for and replace text in PDF files. Its working properly but colored text replace in style does not get how to fix it?

Here's the code I'm currently using:


import os
import fitz

# Prompt user for input of file name
file_name_input = input("Enter a start or text of the file name: ")

# Get a list of PDF files in the current directory matching the file name input
pdf_files = [f for f in os.listdir() if f.lower().endswith('.pdf') and file_name_input.lower() in f.lower()]

if not pdf_files:
    print("No PDF files found matching the file name input")
else:
    # Prompt user for input of search and replace text
    search_replace_list = []
    while True:
        search_text = input("Enter the search text (leave blank to exit): ")
        if not search_text:
            break
        replace_text = input("Enter the replace text: ")
        search_replace_list.append((search_text, replace_text))

    for file_name in pdf_files:
        pdf_file = fitz.open(file_name)
        found = False
        for page in pdf_file:
            for search_text, replace_text in search_replace_list:
                draft = page.search_for(search_text.strip(), hit_max=16, quads=True, quads_tol=0.01)
                if draft:
                    found = True
                    for rect in draft:
                        annot = page.add_redact_annot(rect, text=replace_text)
                    page.apply_redactions()
                    page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)

        if found:
            output_file_name = file_name[:-4] + '_modified.pdf'
            pdf_file.save(output_file_name, garbage=False, deflate=True, encryption=False)
            print(f"Changes saved to {output_file_name}")
        else:
            print(f"No search text found in {file_name}")

        pdf_file.close()
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
Hetul
  • 11
  • 1
  • 1
    You are right: If you leave the replacement text insertion to the redaction logic, then this comes with a number of drawbacks - among them missing color support, font support and imprecise insertion point. The only way out is using a 3-step strategy: (1) extract metadata of to-be-removed text (font, color, insertion point / bbox, etc.) (2) remove it via redaction, (3) insert new text at old position using previous metadata as desired. – Jorj McKie Jun 27 '23 at 13:37

1 Answers1

0

disclaimer I am the author of borb, the library used in this answer.

borb comes with a tool called SimpleFindReplace.

It does exactly what you'd expect, and I think the latest version matches font, color, etc (but I may be wrong).

#!chapter_007/src/snippet_013.py
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleFindReplace

import typing


def main():

    # attempt to read a PDF
    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle)

    # check whether we actually read a PDF
    assert doc is not None

    # find/replace
    doc = SimpleFindReplace.sub("Jots", "Joris", doc)

    # store
    with open("output2.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)


if __name__ == "__main__":
    main()

For the input document:

enter image description here

It produces the following output document:

enter image description here

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54