TextElement Enumerator Class Bug or (Tamil) Unicode Bug

Question

why the TextElementEnumerator not properly parsing the Tamil Unicode character.

using System;
using System.Collections.Generic;
using System.Globalization;

namespace Glyphtest
{
    internal class Program
    {
        private static void Main()
        {
            const string unicodetxt1 = "ஊரவர் கெளவை";
            List<string> output = Syllabify(unicodetxt1);
            Console.WriteLine(output.Count);
            const string unicodetxt2 = "கௌவை";
            output = Syllabify(unicodetxt2);
            Console.WriteLine(output.Count);
        }

        public static List<string> Syllabify(string unicodetext)
        {
            if (string.IsNullOrEmpty(unicodetext)) return null;
            TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(unicodetext);
            var data = new List<string>();
            while (enumerator.MoveNext())
                data.Add(enumerator.Current.ToString());
            return data;
        }
    }
}

Following above code sample deals with Unicode character

'கௌ'-> 0x0bc8 (க) +0xbcc(ௌ). (Correct Form)

'கௌ'->0x0bc8 (க) +0xbc6(ெ) + 0xbb3(ள) (In Correct Form)

Is it bug in Text Element Enumerator Class , why its not to Enumerate it properly from the string.

i.e கெளவை => 'கெள'+ 'வை' has to enumerated in Correct form

கெளவை => 'கெ' +'ள' +'வை' not to be enumerated in Incorrect form.

If so how to overcome this issue.

Run the code and see the output string array content while on debug.see how the character has enumerated it in incorrect form. — Arunkumar Chandrasekaran, Sep 24 '13 at 12:34
First one does 8 where as second one does 2. What's your question in that? Which one is correct? first? — Sriram Sakthivel, Sep 24 '13 at 12:40
oh god 'கௌ' is a single character of visual glyph,it is not 'கெ' 'ள' two character visual glyph. please use charmap on windows, font latha and see difference. — Arunkumar Chandrasekaran, Sep 24 '13 at 12:43
Oh god am also tamil only man. But let me know what's the problem pls. I asked twice already. Atleast tell what is the expected output — Sriram Sakthivel, Sep 24 '13 at 12:48

score 1 · Accepted Answer · answered Oct 11 '13 at 07:06

Its not been bug with Unicode character or TextElementEnumerator Class, As specific to the lanaguage (Tamil)

letter made by any Tamil consonants followed by visual glyph

for eg- க -\u0b95 ெ -\u0bc6 ள -\u0bb3

form Tamil character 'கெள' while its seems similar to formation of visual glyph

க -\u0b95 ௌ-\u0bcc

and its right form to solution. hence before enumerating Tamil character we have replace irregular formation of character.

As with rule of Tamil Grammar (ஔகாரக் குறுக்கம்) the visual glyph (ௌ) will come as starting letter of a word.

so that. the above code is to be should processed as

internal class Program
{
    private static void Main()
    {
        const string unicodetxt1 = "ஊரவர் கெளவை";
        List<string> output = Syllabify(unicodetxt1);
        Console.WriteLine(output.Count);
        const string unicodetxt2 = "கௌவை";
        output = Syllabify(unicodetxt2);
        Console.WriteLine(output.Count);
    }

    public static string CheckVisualGlyphPattern(string txt)
    {
        string[] data = txt.Split(new[] { ' ', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
        string list = string.Empty;
        var rx = new Regex("^(.*?){1}(\u0bc6){1}(\u0bb3){1}");
        foreach (string s in data)
        {
            var matches = new List<Match>();
            string outputs = rx.Replace(s, match =>
            {
                matches.Add(match);
                return string.Format("{0}\u0bcc", match.Groups[1].Value);
            });
            list += string.Format("{0} ", outputs);
        }
        return list.Trim();
    }

    public static List<string> Syllabify(string unicodetext)
    {
        var processdata = CheckVisualGlyphPattern(unicodetext);
        if (string.IsNullOrEmpty(processdata)) return null;
        TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(processdata);
        var data = new List<string>();
        while (enumerator.MoveNext())
            data.Add(enumerator.Current.ToString());
        return data;
    }
}

It produce the appropriate visual glyph while enumerating.

score 0 · Answer 2 · answered Sep 24 '13 at 14:27

U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ has Grapheme_Cluster_Break=XX (Other). This makes the grapheme clusters <U+0BC8 U+0BC6><U+0BB3> the correct ones since there is always a grapheme cluster break before characters with Grapheme_Cluster_Break equal to Other.

<U+0BC8 U+0BCC> has no internal grapheme cluster breaks because U+0BCC has Grapheme_Cluster_Break=SpacingMark and there are usually no breaks before such characters (exceptions are at the start of text or when preceded by a control character).

Well, at least this is what the Unicode standard has to say (http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).

Now, I have no idea of how Tamil works, so take what follows with a pinch of salt.

U+0BCC decomposes into <U+0BC6 U+0BD7>, meaning the two sequences (<U+0BC8 U+0BC6 U+0BB3> and <U+0BC8 U+0BCC>) not canonically equivalent, so there is no requirement for grapheme cluster segmentation to yield the same results.

When I look at it with my Tamil-ignorant eyes, it seems U+0BCC ᴛᴀᴍɪʟ ᴀᴜ ʟᴇɴɢᴛʜ ᴍᴀʀᴋ and U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ look exactly the same. However, U+0BCC is a spacing mark, but U+0BB3 isn't. If you use U+0BCC in the input instead of U+0BB3, the result is what you expected.

Going on a limb, I will say that you are using the wrong character but, again, I don't know Tamil at all so I can't be sure.

TextElement Enumerator Class Bug or (Tamil) Unicode Bug

2 Answers2