0

Covert Rect location from pymupdf to a page number

If I get the locations of certain text like "exam" and get the rectangle location. I then highlight the text in the pdfs with that location. I now want to delete all other pages that do not have this text in it so I use the doc.select() function to select the pages I want to keep before making a save of the new pdf with the pages with highlighted text on only.

The Issue

You have to pass a dictionary to the doc.select() function with the page numbers I want to keep. So what I tried to do was to pass the dictionary with the rectangle coordinates to this function but I got the following error <br>

ValueError: bad page number(s)

<br> I know understand that I must be able to convert the coordinates of the rectangles to page numbers. But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong) . <br>

Current code

from pathlib import Path
import fitz
directory = "pdfs"
 
# iterate over files in
# that directory
files = Path(directory).glob('*')
for file in files:

    doc = fitz.open(file)

    for page in doc:
        ### SEARCH
        text = "Exam"
        text_instances = page.search_for(text)
        ### HIGHLIGHT
        for inst in text_instances:
            highlight = page.add_highlight_annot(inst)
            highlight.update()


### OUTPUT
doc.select(text_instances)
doc.save("output.pdf", garbage=4, deflate=True, clean=True)

Pdf that I used for testing purposes:
pdfpdf logo

GCIreland
  • 145
  • 1
  • 16

1 Answers1

0

I know understand that I must be able to convert the coordinates of the rectangles to page numbers. But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong).

That is completely wrong! The rectangles returned by text searches are locations on the current page and have nothing to do with page numbers.

You already are iterating over pages. If your search text has been found on some page, put that page's number in a list, then do your highlights. When finished with a document, select() with the pages memorized, close document, empty the page selection, then continue with next document.

Something like that:

for filename in filenamelist:
    select_pages = []
    doc = fitz.open(filename)
    for page in doc:
        hits = page.search_for(text)
        if hits == []:
            continue
        select_pages.append(page.number)
        for rect in hits:
            page.add_highlight_annot(rect)
    doc.select(select_pages)
    doc.save(...)
    doc.close()
Jorj McKie
  • 2,062
  • 1
  • 13
  • 17