RTL (Arabic) ligatures problem when extracting text from PDF

Question

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.

The problem is when you have ligatures chars that are composed of two chars, these ligatures chars are not reversed which makes the extracted text inaccurate.

Here's an example :

Let's say we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:

When you extract the pdf text using the library text extraction method, you get this:

كارتــــــشلاا

Notice how all chars are in reverse order except "لا".

And here's the final result after applying bidi algorithm:

االشــــــتراك

Am I missing something? Is there any workaround to fix this without detecting false positives and breaking them, or should I write my own implementation that correctly handles ligatures decomposition in bidirectional text?

score 1 · Answer 1 · answered Jan 30 '23 at 05:11

1

Most likely, the actual text on the PDF page isn't Unicode, but font CIDs (identifying the glyph used) and that the program converting the CIDs to Unicode doesn't take RTL into account.

An example using RTL with english (sorry), suppose the word "fire" is rendering RTL as "erif" with 3 glyphs: e, r, and fi (through arbitrary CIDs, perhaps \001\002\003). If the CIDs are used to get the Unicode information, and the "fi" ligature is de-ligatured, you'll get "erfi" as the data.

In this case, there's no way of knowing that the 'f' and 'i' characters should actually compose a ligature and be flipped around. I'm assuming that's the case for these Arabic characters.

It's unlikely that the tools you're using know anything about RTL or are going to be much help here. You'll need different tools, or to use an approach that can get you the CID's directly so you can recompose the Unicode in the correct order.

answered Jan 30 '23 at 05:11

dirck

838
5
10

This is exactly what's happening. The example that you provided in English in still relevant. However, I guess that it's possible to refer to cmap table of the embedded font or ToUnicode Object that is stored in the PDF to detect ligatures or chars that are composed of more than one unicode, then reverse them when the text or the language is RTL. – Naourass Derouichi Jan 30 '23 at 12:21
You still have the logical problem of "are there legal cases of the 'if' sequence that aren't the ligature 'fi'?". This would be true in the pseudo-English example. You may need the CIDs used in the rendering stream (text drawing instructions) to know the difference. – dirck Jan 30 '23 at 17:33
I've found a workaround and posted it [here](https://github.com/py-pdf/pypdf/issues/1589). Same logic could be applied in other libs to fix this issue. – Naourass Derouichi Jan 31 '23 at 05:39

RTL (Arabic) ligatures problem when extracting text from PDF

1 Answers1