I try create top frequency words tables for many languages. I read Wikipedia text and isolate words. To detect if is alphanumeric I use u_isalnum from ICU (C++). This function takes as parameter 32 bit codepoint. It work correctly for latin chars (English), extended latin (Polish) and I think will also for Greek, Russian, Hebrew, Arabic etc.
But how with Chinese and Japanese? I must collect single chars, not series chars to space and punctuation. How detect, Unicode codepoint is ideogram?
Mu first simple solution: manually check if code is in range of Chinese and Japanese, but can be more ideograms codes.
Asked
Active
Viewed 22 times
0

Saku
- 383
- 2
- 12
-
Note: all of them are characters. Ideogram is just a subcategory. Is Hiragana an ideogram? (Probably not: they originated from ideogram, but now describe syllables). About modern Korean? No (it is artificial). And some languages are weirder. So why do you need it? Maybe is it better to use "script" as unit. Do you handle Egyptian scripts like Kanji? (Second recommendation: do not overgeneralize human stuffs. Handle just the script you need (and you had a minimal knowledge of structure and problems). – Giacomo Catenazzi Jul 25 '23 at 08:05
1 Answers
1
East-Asian characters have mostly a Unicode category of Lo
(other Letter). This is sufficient for u_isalnum
to return true, according to the documentation. This means, it should be perfectly fine for you to keep using u_isalnum
for a first iteration to match strings of words.
To then split them up in single words you might need a word list for comparison. Search for “chinese word segmentation”. I would be surprised if there isn’t at least part of the problem already solved. But beware that it might lead you switftly into natural language processing territory.

Boldewyn
- 81,211
- 44
- 156
- 212
-
segmentation goes way beyond the scope of my program, instead I will check for Chinese / Japanese characters and write warning, but thanks, segmentation is interesting problem – Saku Jul 24 '23 at 14:56