1

I'm currently using pymupdf to extract text blocks from a file in python.

enter image description here

import fitz

doc = fitz.open(filename)

for page in doc:
    text = page.get_text("blocks")

    for item in text:
        print(item[4])

The problem is that drop caps are recognized weirdly. For example, "N is recognized in multiple lines as:

£ £ "1L
^ L I
JL 
^1

I thought it can be an encoding problem so I tried utf-8 encoding as follows:

text = page.get_text().encode("utf8") 

However, the problem is still the same. How can I solve this? Thanks in advance!

Esraa Abdelmaksoud
  • 1,307
  • 12
  • 25

1 Answers1

1

That is perfect output as that's how the One Character Replace at a time (OCR) wrote the PDF, and the only way to correct that is do your own One Character Replacement as could be done here

enter image description here

enter image description here

How to correct that text depends on the means at your disposal. Here in a web browser as HTML we could edit the PDF to delete the unwanted characters, so that could be done by python using a puppeteer approach (very unwieldy) enter image description here

Another simpler alternative is just use Python to export the base text and re-use that with edits in a Word Processor (see below) to add styling such as Inject the missing Drop Cap, or MUCH MUCH simpler still, import the PDF into MS or open Office and use the native styling and spell checking direct without Python.

enter image description here enter image description here enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36
  • Thanks for your reply! Actually, I'm trying to tackle the problem in python, not a certain software. Do you know a package that can give the same output? – Esraa Abdelmaksoud Mar 11 '23 at 13:34