1

so, I have a pdf file. I am reading it via the PyMuPDF package.

I read the text and break the text into chunks. So for the below text screenshot in one of the pages of the original pdf, I get the text read as below:

enter image description here

The text I have in Python:

text_variable = cancer. Your team should include the following \nboard-certified experts:\n \n� A pulmonologist is a doctor who’s an \nexpert of lung diseases.\n \n� A thoracic radiologist is a doctor who’s \nan expert of imaging of the chest

as u can see it is having issues reading Unicode characters.

Now I need to find the above text on the pdf page and then highlight those lines using Annotation in PyMUPDF. I tried below:

doc = fitz.open("/Users/abc.pdf") # open a document

page = doc.load_page(13)

#print(page.get_text())

text_variable = "cancer. Your team should include the following \nboard-certified experts:\n \n� A pulmonologist is a doctor who’s an \nexpert of lung diseases.\n \n� A thoracic radiologist is a doctor who’s \nan expert of imaging of the chest"


quads = page.search_for(text_variable, quads=True)

#Add a highlight annotation for each rectangle
page.add_highlight_annot(quads)

As you would expect, it won't be able to find the corresponding text on the pdf page as it's not exactly the same due to Unicode and escape sequence issues.

Does anyone know how to make it work?

Thanks

Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
  • You don't have an issue "reading unicode characters". Your PDF rather uses a character for the bullet point, for which there exists **_no backtranslation in the font._** In all those cases the "invalid unicode" `0xFFFD` is generated - that question mark in a black diamond. PyMuPDF supports highlighting multiple lines by specifying a start point and a stop point. Please consult the documentation. In this way, non-translatable characters are simply highlighted together with the rest. Look at examples here: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/word%26line-marking. – Jorj McKie Jun 20 '23 at 13:12

0 Answers0