1

I'm processing Thai keyboard input. Some of the keys are vowel signs and only allowed when combined with certain preceding characters.

Here 0x0E33 is the vowel sign

For example 0x0E1C + 0x0E33 is valid
but 0x0E44 + 0x0E33 is not valid and the 0x0E33 should be ignored.

I'm looking to find a way to know when I should ignore the vowel sign, or when it does not combine with the previous character.

Any ideas please?

Be Brave Be Like Ukraine
  • 7,596
  • 3
  • 42
  • 66

2 Answers2

0

Many Thai vowels (and Tone Marks, by the way) belong to the Non-Spacing Combining Marks category. Your goal is to use some library that would tell which category each character belongs to. Then you may decide whether to "ignore" it, whatever the "ignoring" means in your application context.

Check Unicode General Category Values

Your two points of interest are:

  • Lo | Other_Letter for normal character;
  • Mn | Nonspacing_Mark for zero-width non-spacing marks;

Further reading:

Be Brave Be Like Ukraine
  • 7,596
  • 3
  • 42
  • 66
  • Yes, I can use the ICU library to get that information. However the Thai vowel can only be combined with certain preceding letters. Otherwise it takes up space (and is an incorrect combination). I'm trying to figure out a generic way to determine if it's a valid combination. –  Feb 12 '17 at 08:38
  • Looking at the link you provided (Unicode data for Thai script) I'm not sure how your comment helps. The category value of all three characters in my example are the same (Lo). –  Feb 12 '17 at 11:33
0

I know his thread is from a few years ago but this is what I have come up with using the icu lib I suspect it can be improved ...

UChar32 newChar;
UChar32 previousChar;

int32_t gcb = u_getIntPropertyValue(newChar, UCHAR_GRAPHEME_CLUSTER_BREAK);
if (gcb != U_GCB_OTHER)
{
    int32_t insc = u_getIntPropertyValue(newChar, UCHAR_INDIC_SYLLABIC_CATEGORY);
    if (insc == U_INSC_VOWEL_DEPENDENT || insc == U_INSC_TONE_MARK)
    {
        if (u_getIntPropertyValue(prevChar, UCHAR_INDIC_SYLLABIC_CATEGORY) != U_INSC_CONSONANT)
        {
            // invalid combination, ignore
        }
    }
}
Michael T
  • 619
  • 6
  • 19