Thai character not rendered correctly in PDF

Question

My app should be able to output a PDF file containing the user guide in several supported languages. (I'm using pdfkit)

I had some troubles finding a suitable font for Thai: some so-called Thai supported languages (included Noto Thai from Google) would output squares, question marks or even worse unreadable stuff.

After a bit of research, I found one that seemed to work reasonably well, until our Thai guy noted that the charachters

ต่ำ

were rendered like in the picture below, basically with the two elements above the first character collapsed with one covering the other

I'm using Nimbus Sans Thai Family downloaded from myfonts.com that, by the way, would seem able to render those characters correctly, as you might appreciate trying to copypaste ต่ำ in the preview input

Any hints?

Be Brave Be Like Ukraine · Answer 1 · 2020-09-10T16:31:40.347

Your font is incomplete in a certain way. It lacks some glyphs that usually reside in Private Use Area (PUA) of Unicode.
Some applications (I'm aware of Microsoft Word) can manually overcome this problem, but your rendering app (and Adobe Acrobat Viewer) does not.
You should either find a font with these glyphs presenting or alternatively find an application that would displace the existing glyphs manually.

Many fonts, despite they claim supporting Thai (and they, indeed, contain "regular" Thai glyphs), can be incomplete.

Besides canonic glyphs, a well-formed font should contain a "Private Use Area" (PUA) subrange that contains glyphs in non-canonical forms. Those glyphs include:

Tone marks shifted to the upper position for use in combination with upper vowels (SARA_I, SARA_UE, etc) and shifted in a lower position in case of Consonant + Tone Mark and no upper vowel;
Tone marks and upper-vowels slightly shifted to the left for use in combination with PO_PLA, FO_FAN, etc (otherwise it would overlap with the consonants' upper tail);
also, both effects combined, e.g. the tone mark shifted down-left at the same time:
Special glyphs for YO_YING and THO_THAN (with no tail) for use in combination with under-vowels;
Several more;

Normally, when a rendered app finds above mentioned symbol combinations, it looks for substitute glyphs in PUA area. If not found, it simply falls back to default glyph, which happens in your case.

Here are two screenshots of PUA areas of Arial Unicode and FreeSerif which are self-explanatory: FreeSerif has PUA empty. I think, the same problem occurs with your Nimbus font.

And the final observation. Incorrect fonts can be incorrect in different ways. Above I have described a more canonical case when the standard positions of tone marks a upper positions, while non-standard positions are shifted down (or are absent, which constitutes an incomplete font).
There are, however, fonts that behave the opposite way; they (only) contain tone marks in lower positions. This is what you seem to observe.

score 0 · Answer 2 · answered Aug 03 '17 at 07:09

The problem is that PDFKit does not perform complex script rendering.
Several scripts such as arabic, thai etc, require glyph substitution and re-positioning depending on context (position in string, neighbor characters) and PDFKit seems not to do it.
PDF viewer applications display exactly what is defined in the PDF file. The Nimbus Sans Thai font probably includes all the required glyphs but what bytebuster explains in his answer needs to be performed by PDFKit and not by the viewer application.

score 0 · Answer 3 · answered Apr 29 '23 at 08:29

Old thread, but I'll offer an explanation anyway... I was having similar problems doping a copy and paste from a Thai vocabulary list in PDF form.

I've discovered that the problem lies in the character set embedded in the PDF.

Copying the second line of the PDF to a UTF-16 converter, I get the following sequence:

รู%

\u0e23\u0e39\u0025\u000a

The same word, copied correctly from G-translate:

รู้

\u0e23\u0e49\u0e39\u000a

So in the PDF's font, the problem diacritic seems to be encoded as \u0025\ but in translate as \u0e49\, which is the official standard Unicode for that particular character, according to Wikipedia's page about the Thai script.

In another example, the pipe character shows up instead of the tone mark:

ฟIง

\u0e1f\u0049\u0e07\u000a

ฟัง

\u0e1f\u0e31\u0e07\u000a

\u0049\ instead of \u0e31\

It would be feasible to write a Python script, converting the characters in a text to unicode, replacing the problem codes with their correct counterparts and back.

The problem here is that the position of the problem character in the first example actually differs (?!?!?!). Also there is possibly up to 10 of these incorrectly encoded characters to find.

Thai character not rendered correctly in PDF

3 Answers3