Case-sensitive PDF highlighting using PyMuPDF and re

Question

The goal is a program that can take a PDF of a script as well as the name of a character and output a script with only that character's lines (or at least their name) highlighted. An example of the way these scripts are typically formatted: Here I would want just "MISHA" highlighted, but not "Misha" in the italic stage directions, eg

I was able to get a version of this working with PyMuPDF, but it would highlight every instance of the characters name.

The case-insensitive version was:

doc = fitz.open("HighlightTest.pdf")

character = input("Character name? ").upper()

for page in doc:
    ### SEARCH
    text = character
    text_instances = page.searchFor(character)
    

    ### HIGHLIGHT
    for inst in text_instances: 
        highlight = page.addHighlightAnnot(inst)
        highlight.update()

Which then spit out a PDF with every instance of "character" highlighted- as expected.

I found the following bit about case-sensitive searching from the PyMuPDF documentation:

"Note A feature repeatedly asked for is supporting regular expressions when specifying the "needle" string: There is no way to do this. If you need something in that direction, first extract text in the desired format and then subselect the result by matching with some regex pattern. Here is an example for matching words:"

pattern = re.compile(r"...")  # the regex pattern
words = page.get_text("words")  # extract words on page
matches = [w for w in words if pattern.search(w[4])]

So I'm trying to figure out how to implement this as follows:

doc = fitz.open("HighlightTest.pdf")
    
character = input("Character name? ").upper()

for page in doc:
    text = character
    words = page.get_text(character)  # extract words on page
    matches = [w for w in words if pattern.search(w[4])]

    for inst in matches:
        highlight = page.addHighlightAnnot(inst)
        highlight.update()

where pattern = re.compile("^"+character).

This gives the following Error:

File "C:\Users\me\Desktop\Python Projects\highlighter.py", line 45, in matches = [w for w in words if pattern.search(w[4])]

IndexError: string index out of range

Unsure how to proceed from here and would welcome any advice! I'm certain that what I have above is jank in many ways, so no proposed solution is too basic. Thanks!

Pannag · Accepted Answer · 2022-03-19T11:25:56.807

I came across exact issue and I was able to solve it with the help of PyMuPDF's one more function, get_text("words",sort=False)

please find the below doc for more information: [1]:https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractWORDS

This function will return you the container that has 4 rectangular coordinates , followed by the exact Text (word) which looks like below: (x0, y0, x1, y1, "word", block_no, line_no, word_no)

Then you can take these returned items , find the word one by one for exact (case sensitive) match. Incase you have already formed sentences to match against pdf content, you can retain the order of words from PDF as original with help of argument "sort" by setting it to False, then check each word of your sentence sequentially to check if that pattern is noticed inside the word list.

For whatever match is found , just provide the rectangular coordinates for highlighter object by following steps:

Convert the collected coordinates ( first four elements ) to Rect object with the help of fitz.Rect(x0,y0,x1,y1).
Pass this object into page_obj.add_highlight_annot.

    import fitz #Pymupdf library
    
    pdf_file = fitz.open(<file_name>.pdf)  #Create pdf file object
    pdf_page_count = pdf_file.page_count   #var to hold page count
    for page in range(pdf_page_count):  #notice that page starts with index 0
       page_obj = pdf_file[page] #Create page object
       content_of_page = pdf_file.get_page_text(page) #Get page content
       match_word = "MONTANA" 
       content_of_page = page_obj.get_text("words",sort=False)  #get rect for all words
       for word in content_of_page:
          if word[4] == match_word:
             rect_comp = fitz.Rect(word[0],word[1],word[2],word[3])
             highlight = page_obj.add_highlight_annot(rect_comp)
             highlight.set_colors(stroke=[0, 1, 0.8])
             highlight.update()

Thanks for this! Can you help me understand where the code you shared fits in to a larger program? How should I open the pdf/separate the pages before the above function? Apologies if this is a basic question! — deep_node, Mar 19 '22 at 05:08
if you will be using PyMYPDF, then you have functions for each task. Right from opening of file all the way till highlighter. The best way to fit your highlighter code is simply to put it soon after you have extracted text from pdf, say you want to highlight specific words which are case sensitive. I have updated code in my answer for better understanding. Please refer it. Also I would suggest you to go through the Pymupdf website as I am sure you may want even more features to your bucket. — Pannag, Mar 19 '22 at 11:11
You can also use regular expression for exact match.. probably word bound more suitable if you want to extract word alone. — Pannag, Mar 20 '22 at 12:23
I'm back, one year on, to say a hearty THANK YOU. Picked this project back up and got it working tonight. Modified it to be able to browse for a file and input a list of character/actor pairs to highlight each of their scripts and spit out a pdf for each. Really really appreciate your help! — deep_node, Apr 29 '23 at 04:31

Case-sensitive PDF highlighting using PyMuPDF and re

1 Answers1