When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.
The problem is when you have ligatures chars that are composed of two chars, these ligatures chars are not reversed which makes the extracted text inaccurate.
Here's an example :
Let's say we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:
When you extract the pdf text using the library text extraction method, you get this:
كارتــــــشلاا
Notice how all chars are in reverse order except "لا".
And here's the final result after applying bidi algorithm:
االشــــــتراك
Am I missing something? Is there any workaround to fix this without detecting false positives and breaking them, or should I write my own implementation that correctly handles ligatures decomposition in bidirectional text?