so, I have a pdf file. I am reading it via the PyMuPDF
package.
I read the text and break the text into chunks. So for the below text screenshot in one of the pages of the original pdf, I get the text read as below:
The text I have in Python:
text_variable = cancer. Your team should include the following \nboard-certified experts:\n \n� A pulmonologist is a doctor who’s an \nexpert of lung diseases.\n \n� A thoracic radiologist is a doctor who’s \nan expert of imaging of the chest
as u can see it is having issues reading Unicode characters.
Now I need to find the above text on the pdf page and then highlight those lines using Annotation in PyMUPDF. I tried below:
doc = fitz.open("/Users/abc.pdf") # open a document
page = doc.load_page(13)
#print(page.get_text())
text_variable = "cancer. Your team should include the following \nboard-certified experts:\n \n� A pulmonologist is a doctor who’s an \nexpert of lung diseases.\n \n� A thoracic radiologist is a doctor who’s \nan expert of imaging of the chest"
quads = page.search_for(text_variable, quads=True)
#Add a highlight annotation for each rectangle
page.add_highlight_annot(quads)
As you would expect, it won't be able to find the corresponding text on the pdf page as it's not exactly the same due to Unicode and escape sequence issues.
Does anyone know how to make it work?
Thanks