Possible mixed-encoding string: how to find the missing character?

Question

I have a PDF document that is mostly Farsi, but also includes some Latin characters with diacritics above or below from the DMG's DIN 31635 standard for the transliteration of the Arabic alphabet. The PDF mostly displays both, the LTR- as well as the RTL-text, correctly and they can also mostly be extracted correctly by tools such as TET (commercial) or GhostScript's gs.

Unfortunately, there are single characters, specifically in some (but not all) locations where there would have to be a ḍ¹ (E1 B8 8D) where a white rectangle containing 00 1F ("information separator one") is instead rendered by Pango. I'm trying to use that one as an example for figuring out a way to correct all the other broken ones.

The other bytes of the word, before and after the one in question, are 20 (-), 77 (w) and 75 (u), then the 1F itself, then C5 (Å), AB («), E2 (â), 80 (€) and 99 (™), then finally 20 (space) before the next word begins. The word itself, if it had been properly entered and were being properly displayed, would be wuḍūʼ (the DMG-representation of وضوء, meaning the ritual washing before prayer in Islam).

Gedit shows it like so:

Going through character by character, 75 and 77 are valid ASCII as well as valid Unicode Basic Latin. Then there is what certainly looks to be "latin small letter u with macron" (01 6B), but I see no 0, 1, or 6 in the hex representation. The same with the last character, which looks to be "right single quotation mark" (20 1A) if only there were a 1 or a B anywhere after C5. None of 1F C5 (???), AB E2 (�) or 80 99 (�) make any sense, either.

Two questions:

Where am I failing in my understanding of how this string is composed?
Based on the context, is there any way to find out what character encoding the "missing" character might originally (on the PDF's author's computer, perhaps) have been produced from?

—

¹ StackExchange seems to filter the funny character out; uploaded a file containing only the word discussed.

From what you say, I would think there is a bug either in pango, or in the font, or in the file when it was encoded. — Giacomo Catenazzi, Oct 24 '19 at 12:26
I don't care about the rendering, perhaps my phrasing can be misunderstood here. I just want to know what character(s) from what encoding(s) can be expressed by `1FC5ABE28099`, i.e.: between which bits of that sequence are the character boundaries and which character could have come from which encoding (and yes, I'm suspecting multiple encodings because of a problem when the file was saved, like you suggest). — Sixtyfive, Oct 24 '19 at 16:03
`C5ABE2` is `ū` in UTF-8, `1F` is a control character (which pango display as `[00|1F]`), but `8099` is not valid UTF-8 (and Kanji in UTF-16). If you want to understand better, you should go a step earlier. What are the encoded values of `wu`? Maybe there are "escapes e.g. `1B`", as ISO2022. Could you check also how the arabic characters are encoded? UTF-8? ISO-2022? UTF-16? Such problems need a lot of detective work — Giacomo Catenazzi, Oct 25 '19 at 07:26
It'll take a few days until I can look into it again. Thank you for your comment so far, Giacomo! — Sixtyfive, Oct 28 '19 at 13:12

Possible mixed-encoding string: how to find the missing character?

0 Answers0