0

I have the following same code run in API level 16 vs API level 21, and I found that in API level 16, the dictionary based iterator (tokenizer) seems not working, while in API level 21, the dictionary based iterator is working properly.

BreakIterator it = BreakIterator.getWordInstance();
String txt = "我们一起";
it.setText(txt);
int start = it.first();
int end = it.next();

buf = new StringBuffer();

while (end != BreakIterator.DONE) {
    String word = txt.substring(start,end).trim();
    if (!word.isEmpty()) {
        buf.append(word);
        buf.append("+");
    }
    start = end;
    end = it.next();
}

vw.setText(buf);

In API Level 21, the text view shows ("我们" is a word, "一起" is a word)

   我们+一起+

However in API Level 16, it shows as below (each Chinese character is a word):

   我+们+一+起+

So I suspect that the API level 21 has enabled the dictionary based iterator, while previous API versions not.

However, after I have a search in the C++ source code of Android, I found that the key function RuleBasedBreakIterator::checkDictionary is both there in rbbi.cpp, for both API levels. It gives me the hints that both API shall support dictionary based iterator. I also suspect that the difference is because of the different category value set for different char-set. However I am not able trace back how these values are set and whether there is difference.

My question is, how to further confirm that the API implementation is enhanced in API level 21?

Gordon Liang
  • 348
  • 2
  • 11
  • did you try: `BreakIterator.getWordInstance(Locale);` ? – pskink Nov 05 '15 at 20:38
  • Yes I tried, no matter what Locale I pass, the result is the same. – Gordon Liang Nov 05 '15 at 21:08
  • i assume you tried some Chinese locales? i have no idea about Chinese language but there are several Chinese locales, right? – pskink Nov 05 '15 at 21:15
  • I tried Locale.China, Local.Chinese, Local.SIMPLIFIED_CHINESE as well as Locale.TRADITIONAL_CHINESE, all of them behave the same. Actually from the code I see, I think the way how it works is to justify the charset of each character and apply the rules accordingly. I don't know how it impacts the result by passing a Locale to getWordInstance() – Gordon Liang Nov 06 '15 at 02:58
  • btw did you use the chinese `Locale` that is on the list returned by `public static Locale[] BreakIterator#getAvailableLocales ()` ? or you just used `Locale.SIMPLIFIED_CHINESE` etc ? i don't know if they are the same locales, also there is a nice info on this topic [here](http://userguide.icu-project.org/boundaryanalysis#TOC-Dictionary-Based-BreakIterator) – pskink Nov 06 '15 at 11:23
  • @GordonLiang Have you found out what wrong? – Henry Dec 22 '16 at 06:44

0 Answers0