Recognizing drop caps in PDF in python

Question

I'm currently using pymupdf to extract text blocks from a file in python.

import fitz

doc = fitz.open(filename)

for page in doc:
    text = page.get_text("blocks")

    for item in text:
        print(item[4])

The problem is that drop caps are recognized weirdly. For example, "N is recognized in multiple lines as:

£ £ "1L
^ L I
JL 
^1

I thought it can be an encoding problem so I tried utf-8 encoding as follows:

text = page.get_text().encode("utf8")

However, the problem is still the same. How can I solve this? Thanks in advance!

K J · Answer 1 · 2023-03-11T16:37:14.783

That is perfect output as that's how the One Character Replace at a time (OCR) wrote the PDF, and the only way to correct that is do your own One Character Replacement as could be done here

How to correct that text depends on the means at your disposal. Here in a web browser as HTML we could edit the PDF to delete the unwanted characters, so that could be done by python using a puppeteer approach (very unwieldy)

Another simpler alternative is just use Python to export the base text and re-use that with edits in a Word Processor (see below) to add styling such as Inject the missing Drop Cap, or MUCH MUCH simpler still, import the PDF into MS or open Office and use the native styling and spell checking direct without Python.

Thanks for your reply! Actually, I'm trying to tackle the problem in python, not a certain software. Do you know a package that can give the same output? — Esraa Abdelmaksoud, Mar 11 '23 at 13:34

Recognizing drop caps in PDF in python

1 Answers1