I have a PDF document that is mostly Farsi, but also includes some Latin characters with diacritics above or below from the DMG's DIN 31635 standard for the transliteration of the Arabic alphabet. The PDF mostly displays both, the LTR- as well as the RTL-text, correctly and they can also mostly be extracted correctly by tools such as TET (commercial) or GhostScript's gs
.
Unfortunately, there are single characters, specifically in some (but not all) locations where there would have to be a ḍ
¹ (E1 B8 8D) where a white rectangle containing 00 1F
("information separator one") is instead rendered by Pango. I'm trying to use that one as an example for figuring out a way to correct all the other broken ones.
The other bytes of the word, before and after the one in question, are 20
(-), 77
(w) and 75
(u), then the 1F
itself, then C5
(Å), AB
(«), E2
(â), 80
(€) and 99
(™), then finally 20
(space) before the next word begins. The word itself, if it had been properly entered and were being properly displayed, would be wuḍūʼ
(the DMG-representation of وضوء
, meaning the ritual washing before prayer in Islam).
Gedit shows it like so:
Going through character by character, 75
and 77
are valid ASCII as well as valid Unicode Basic Latin. Then there is what certainly looks to be "latin small letter u with macron" (01 6B
), but I see no 0
, 1
, or 6
in the hex representation. The same with the last character, which looks to be "right single quotation mark" (20 1A
) if only there were a 1
or a B
anywhere after C5
. None of 1F C5
(???), AB E2
(�) or 80 99
(�) make any sense, either.
Two questions:
- Where am I failing in my understanding of how this string is composed?
- Based on the context, is there any way to find out what character encoding the "missing" character might originally (on the PDF's author's computer, perhaps) have been produced from?
—
¹ StackExchange seems to filter the funny character out; uploaded a file containing only the word discussed.