I'm writing a (simple?) JFlex tokenizer whose goal is to take a string, and tease apart the chunks that are in Chinese (or rather using the Han script), and the parts that are in a Latin script. The tokenizer is applied to brand names, and in my use case a brand name may contain both the Latin and the Chinese name, e.g. "Lenovo 联想".
Brand names can further contain numbers (7up), hyphens (Hewlett-Packard), ampersands (P&G), etc. My tokenizer mostly works, except for cases where the names in Chinese and non-Chinese are written together without any space or separation. Specifically, these are examples of successful and unsuccessful parses:
"Calvin Klein卡尔文.克莱" - successfully split into "Calvin Klein" and "卡尔文.克莱", and they get tagged as having the expected script (Latin and Han)
"圣威廉SAINT WILLIAM" - wrongly split into "圣威廉SAINT" (marked as Han chars) and "WILLIAM" (marked as Latin).
"史努比SNOOPY" - wrongly considered a single Han token.
I thought my rules were pretty unambiguous, but the results seem to indicate otherwise. Here's my rule set:
digit = [0-9]
whitespace = [ \t\r\n] | \r\n
latin = [\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u01bf\u01c4-\u024f]
han = [\u3400-\u9fff\uf900-\ufaff\u2f800-\u2fa1f]
// Punctuation in the middle or end of string sequences in a particular script
latin_middle = [&.\-'`‘]
latin_end = [.]
han_middle = [.]
// A basic Latin token contains a mixture of Latin characters and possibly digits.
basic_latin_tok = ({latin} | {digit})+
compound_latin_tok = {basic_latin_tok} (({whitespace}+ | {latin_middle}) {basic_latin_tok})*{latin_end}?
basic_han_tok = {han}({han} | {digit})*
| ({han} | {digit})*{han}
compound_han_tok = {basic_han_tok}({han_middle}{basic_han_tok})*
%%
{compound_latin_tok} { return "Latin"; }
{compound_han_tok} { return "Han"; }
. { /* skip everything else */ }
What am I doing wrong?
Thanks!!
EDIT
I asked the folks on the SourceForge JFlex mailing list, and one of them replied to me - turns out JFlex 1.4.* can't handle Unicode characters that are not representable in 16 bits. Since some of the character ranges I've specified above for Han characters go above 16-bit values, JFlex gets confused. Removing those from the regex made it all work nicely.
For reference: http://jflex.de/manual.html#SECTION000101000000000000000