Right justify string containing Thai characters

Question

I would like to right justify strings containing Thai characters (Thai rendering doesn't work from left to right, but can go up and down as well).

For example, for the strings ไป (two characters, length 2) and ซื้อ (four characters, length 2) I want to have the following output (length 5):

...ไป

...ซื้อ

The naive

print 'ไป'.decode('utf-8').rjust(5)

print 'ซื้อ'.decode('utf-8').rjust(5)

however, respectively produce

...ไป

.ซื้อ

Any ideas how to get to the desired formatting?

EDIT: Given a string of Thai characters tc, I want to determine how many [places/fields/positions/whatever you want to call it] the string uses. This is not the same as len(tc); len(tc) is usually larger than the number of places used. The second word gives len(tc) = 4, but has length 2 / uses 2 places / uses 2 positions.

not clear what software/language/environment this question is about. — owagh, Nov 29 '12 at 20:26
Language, environment? I'm on a MacBook Air, Python 2.7... is that the environment? — user1864353, Nov 29 '12 at 21:07

score 1 · Answer 1 · answered Feb 13 '16 at 00:52

Cause

Thai script contains normal characters (positive advance width) and non-spacing marks as well (zero advance width).

For example, in the word ซื้อ:

the first character is the initial consonant "SO SO",
then it has vowel mark SARA UUE,
then tone mark MAI THO,
and then the final pseudo-consonant O ANG

The problem is that characters ##2 and 3 in the list above are zero-width ones.
In other words, they do not make the string "wider".
In yet other words, ซื้อ ("to buy") and ซอ ("fiddle") would have equal width of two character places (but string lengths of 4 and 2, correspondingly).

Solution

In order to calculate the "real" string length, one must skip zero-width characters.

Python-specific

The unicodedata module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 8.0.0.

The unicodedata.category(unichr) method returns one the following General Category Values:

"Lo" for normal character;
"Mn" for zero-width non-spacing marks;

The rest is obvious, simply filter out the latter ones.

Further info:

Unicode data for Thai script (scroll till the first occurrence of "THAI CHARACTER")

score 0 · Answer 2 · answered Nov 29 '12 at 22:25

I think what you mean to ask is, how to determine the 'true' # of characters in เรือ, ไป, ซื้อ etc. (which are 3,2 and 2, respectively)

Unfortunately, here's how Python interprets these characters:

ไป

>>> 'ไป'
'\xe0\xb9\x84\xe0\xb8\x9b'
>>> len('ไป')
6
>>> len('ไป'.decode('utf-8'))
2

ซื้อ

>>> 'ซื้อ'
'\xe0\xb8\x8b\xe0\xb8\xb7\xe0\xb9\x89\xe0\xb8\xad'
>>> len('ซื้อ')
12
>>> len('ซื้อ'.decode('utf-8'))
4

เรือ

>>> 'เรือ'
'\xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb7\xe0\xb8\xad'

>>> len('เรือ')
12
>>> len('เรือ'.decode('utf-8'))
4

There's no real correlation between the # of characters displayed and the # of actual (from Python's perspective) characters that make up the string.

I can't think of an obvious way to do this. However, I've found this library which might be of help to you. (You will also need to install some prequisites.

Thanks, Anuj Gupta. Reading through the suggested library functions, it is not clear to me that they will work for Thai; their focus is on East Asian languages. I think, I just implement such a true-length function myself by classifying the corresponding unicode representations. — user1864353, Nov 30 '12 at 07:09

score 0 · Answer 3 · answered Feb 12 '16 at 22:38

It looks like the rjust() function will not work for you and you will need to count the number of cells in the string yourself. You can then insert the number of spaces required before the string to achieve justification

You seem to know about Thai language. Sum the number of consonants, preceding vowels, following vowels and Thai punctuation. Don't count diacritics and above and below vowels.

Something like (forgive my pseudo Python code),

cells = 0

for i in range (0, len(string))
  if (string[i] == \xe31) or ((string[i] >= \xe34) and (string[i] <= \xe3a)) or ((string[i] >= \xe47) and (string[i] <= \xe4e))
     # do nothing
  else
     # consonant, preceding or following vowel or punctuation
     cells++

Bruno Degomme · Answer 4 · 2019-10-17T14:04:50.737

0

Here's a function to compute the length of a thai string (the number of characters arranged horizontally), based on bytebuster's answer

import unicodedata


def get_thai_string_length(string):
    length = 0
    for c in string:
        if unicodedata.category(c) != 'Mn':
            length += 1
    return length

print(len('บอินทัช'))
print(get_thai_string_length('บอินทัช'))

edited Oct 17 '19 at 14:04

answered Oct 17 '19 at 13:54

Bruno Degomme

883
10
11

Right justify string containing Thai characters

4 Answers4