The goal is a program that can take a PDF of a script as well as the name of a character and output a script with only that character's lines (or at least their name) highlighted. An example of the way these scripts are typically formatted: Here I would want just "MISHA" highlighted, but not "Misha" in the italic stage directions, eg
I was able to get a version of this working with PyMuPDF, but it would highlight every instance of the characters name.
The case-insensitive version was:
doc = fitz.open("HighlightTest.pdf")
character = input("Character name? ").upper()
for page in doc:
### SEARCH
text = character
text_instances = page.searchFor(character)
### HIGHLIGHT
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
highlight.update()
Which then spit out a PDF with every instance of "character" highlighted- as expected.
I found the following bit about case-sensitive searching from the PyMuPDF documentation:
"Note A feature repeatedly asked for is supporting regular expressions when specifying the "needle" string: There is no way to do this. If you need something in that direction, first extract text in the desired format and then subselect the result by matching with some regex pattern. Here is an example for matching words:"
pattern = re.compile(r"...") # the regex pattern
words = page.get_text("words") # extract words on page
matches = [w for w in words if pattern.search(w[4])]
So I'm trying to figure out how to implement this as follows:
doc = fitz.open("HighlightTest.pdf")
character = input("Character name? ").upper()
for page in doc:
text = character
words = page.get_text(character) # extract words on page
matches = [w for w in words if pattern.search(w[4])]
for inst in matches:
highlight = page.addHighlightAnnot(inst)
highlight.update()
where pattern = re.compile("^"+character).
This gives the following Error:
File "C:\Users\me\Desktop\Python Projects\highlighter.py", line 45, in matches = [w for w in words if pattern.search(w[4])]
File "C:\Users\me\Desktop\Python Projects\highlighter.py", line 45, in matches = [w for w in words if pattern.search(w[4])]
IndexError: string index out of range
Unsure how to proceed from here and would welcome any advice! I'm certain that what I have above is jank in many ways, so no proposed solution is too basic. Thanks!