1

I would like to right justify strings containing Thai characters (Thai rendering doesn't work from left to right, but can go up and down as well).

For example, for the strings ไป (two characters, length 2) and ซื้อ (four characters, length 2) I want to have the following output (length 5):

...ไป

...ซื้อ

The naive

print 'ไป'.decode('utf-8').rjust(5)

print 'ซื้อ'.decode('utf-8').rjust(5)

however, respectively produce

...ไป

.ซื้อ

Any ideas how to get to the desired formatting?

EDIT: Given a string of Thai characters tc, I want to determine how many [places/fields/positions/whatever you want to call it] the string uses. This is not the same as len(tc); len(tc) is usually larger than the number of places used. The second word gives len(tc) = 4, but has length 2 / uses 2 places / uses 2 positions.

hippietrail
  • 15,848
  • 18
  • 99
  • 158

4 Answers4

1

Cause

Thai script contains normal characters (positive advance width) and non-spacing marks as well (zero advance width).

For example, in the word ซื้อ:

  1. the first character is the initial consonant "SO SO",
  2. then it has vowel mark SARA UUE,
  3. then tone mark MAI THO,
  4. and then the final pseudo-consonant O ANG

The problem is that characters ##2 and 3 in the list above are zero-width ones.
In other words, they do not make the string "wider".
In yet other words, ซื้อ ("to buy") and ซอ ("fiddle") would have equal width of two character places (but string lengths of 4 and 2, correspondingly).

Solution

In order to calculate the "real" string length, one must skip zero-width characters.

Python-specific

The unicodedata module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 8.0.0.

The unicodedata.category(unichr) method returns one the following General Category Values:

  • "Lo" for normal character;
  • "Mn" for zero-width non-spacing marks;

The rest is obvious, simply filter out the latter ones.


Further info:

Be Brave Be Like Ukraine
  • 7,596
  • 3
  • 42
  • 66
0

I think what you mean to ask is, how to determine the 'true' # of characters in เรือ, ไป, ซื้อ etc. (which are 3,2 and 2, respectively)

Unfortunately, here's how Python interprets these characters:

ไป

>>> 'ไป'
'\xe0\xb9\x84\xe0\xb8\x9b'
>>> len('ไป')
6
>>> len('ไป'.decode('utf-8'))
2

ซื้อ

>>> 'ซื้อ'
'\xe0\xb8\x8b\xe0\xb8\xb7\xe0\xb9\x89\xe0\xb8\xad'
>>> len('ซื้อ')
12
>>> len('ซื้อ'.decode('utf-8'))
4

เรือ

>>> 'เรือ'
'\xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb7\xe0\xb8\xad'

>>> len('เรือ')
12
>>> len('เรือ'.decode('utf-8'))
4

There's no real correlation between the # of characters displayed and the # of actual (from Python's perspective) characters that make up the string.

I can't think of an obvious way to do this. However, I've found this library which might be of help to you. (You will also need to install some prequisites.

Anuj Gupta
  • 10,056
  • 3
  • 28
  • 32
  • Thanks, Anuj Gupta. Reading through the suggested library functions, it is not clear to me that they will work for Thai; their focus is on East Asian languages. I think, I just implement such a true-length function myself by classifying the corresponding unicode representations. – user1864353 Nov 30 '12 at 07:09
0

It looks like the rjust() function will not work for you and you will need to count the number of cells in the string yourself. You can then insert the number of spaces required before the string to achieve justification

You seem to know about Thai language. Sum the number of consonants, preceding vowels, following vowels and Thai punctuation. Don't count diacritics and above and below vowels.

Something like (forgive my pseudo Python code),

cells = 0

for i in range (0, len(string))
  if (string[i] == \xe31) or ((string[i] >= \xe34) and (string[i] <= \xe3a)) or ((string[i] >= \xe47) and (string[i] <= \xe4e))
     # do nothing
  else
     # consonant, preceding or following vowel or punctuation
     cells++
koan
  • 3,596
  • 2
  • 25
  • 35
0

Here's a function to compute the length of a thai string (the number of characters arranged horizontally), based on bytebuster's answer

import unicodedata


def get_thai_string_length(string):
    length = 0
    for c in string:
        if unicodedata.category(c) != 'Mn':
            length += 1
    return length

print(len('บอินทัช'))
print(get_thai_string_length('บอินทัช'))
Bruno Degomme
  • 883
  • 10
  • 11