How to handle ligature issue while using pdf text

Question

I need to capture some text from some PDFs. I use PymuPDF to do this. But facing ligature issue while writing those selected text inside a text file.

I use the following code snippet to read the PDF

pdf = fitz.open("file_path") 
full_text = ""
for page_n in range(pdf.page_count):
    page = pdf.load_page(page_n)
    full_text += page.get_text()
pdf.close()

# do some operation to get desire text 
desire_text = ...

And use the following code snippet to write them inside txt file

with open('output.txt', 'w') as f:
    f.write(desire_text)

but got the error:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
     33 with open('output.txt', 'w') as f:
---> 34     f.write(desire_text)

File c:\Python311\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 2: character maps to

I know that the PDF contain some ligature like ﬄ which create the issue. I can manually replace them using string replace, but I don't thing manually handle this can be efficient for large pdf.

You either have to request that ligatures are decomposed via **_setting off_** the bit TEXT_PRESERVE_LIGATURES in the flags parameter of the extraction, **_or_** save the text output as a binary file like e.g. so (using pathlib): `pathlib.Path("output.txt").write_bytes(desire_text.encode())`. — Jorj McKie, Aug 18 '23 at 15:20
The default for `.encode()` is "UTF-8" which is the correct here too. But go ahead and add this parameter "utf8" or so - maybe more "explicit" (Python Zen ). — Jorj McKie, Aug 18 '23 at 15:33
Another issue I face, before saving the text I do some regex and other operation to filter the text. Now I get `TypeError: a bytes-like object is required, not 'str'` @JorjMcKie it seems like the best way to solve my problem was handling those ligatures first then do some operation (e.g. Tokenize the text) then save it — WhyMeasureTheory, Aug 18 '23 at 15:35
Well then you have to swallow the toad and use a text extraction flag which decomposes ligatures: `page.get_text(flags=fitz.TEXTFLAGS_TEXT & ~fitz.TEXT_PRESERVE_LIGATURES)`. But lookup docu please in case I made some spelling error. — Jorj McKie, Aug 18 '23 at 15:40
"Why measure theory ?" Because the measure of the Riemann integral is for beginners only (Dieudonné). — Jorj McKie, Aug 18 '23 at 15:43
It daunted me for 1 year @JorjMcKie , but things have been sorted out now. Your solution work for my case, Thanks buddy — WhyMeasureTheory, Aug 18 '23 at 15:47
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254962/discussion-between-jorj-mckie-and-whymeasuretheory). — Jorj McKie, Aug 18 '23 at 21:52

How to handle ligature issue while using pdf text

0 Answers0