-1

What's the analog of regex for CJK character sets? ASCII or Latin letter-like characters are qualitatively different from CJK characters?

qazwsx
  • 25,536
  • 30
  • 72
  • 106

1 Answers1

1

What's the analog of regex for CJK character sets?

Regex. It has always been capable of working with different character sets, but this becomes much simpler and more reliable with Unicode.

What language/environment are you using? Generally modern implementations all support Unicode characters, though some may be missing extended features like \p{...} for character classes.

ASCII or Latin letter-like characters are qualitatively different from CJK characters?

CJK ideographs and syllabaries don't have upper and lower case, so they're members of the 'Letter, Other' category rather than 'Letter, Uppercase' or 'Letter, Lowercase' as most Latin letters are. They also have different line-breaking properties.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • What's the definition of 'Letter, Other' category, 'Letter, Uppercase' category, and 'Letter, Lowercase' category? Is this some defined concepts some where? – qazwsx Sep 07 '12 at 01:47
  • Yes, it's part of the unicodedata. See [UAX#44](http://unicode.org/reports/tr44/tr44-4.html) – bobince Sep 07 '12 at 16:42