I'm currently using pymupdf to extract text blocks from a file in python.
import fitz
doc = fitz.open(filename)
for page in doc:
text = page.get_text("blocks")
for item in text:
print(item[4])
The problem is that drop caps are recognized weirdly. For example, "N is recognized in multiple lines as:
£ £ "1L
^ L I
JL
^1
I thought it can be an encoding problem so I tried utf-8 encoding as follows:
text = page.get_text().encode("utf8")
However, the problem is still the same. How can I solve this? Thanks in advance!