Which Japanese sorting / collation orders are supported by ICU / CLDR / UCA?

Question

The Japanese language, I believe, has more than one sort order equivalent to alphabetical order in English.

I believe there's at least one based on pronunciation (I think the kana have used two orders historically) and one based on radical + stroke count. Chinese also has multiple orders with one based on radical/stroke but due to Unicode Han Unification the same character can have a different stroke count for Chinese and Japanese.

Since I believe the standard for sort order in Unicode is the CLDR for the data with the UCA for the algorithm, and the reference implementation is ICU.

Implementations generally lag behind standards and this information is proving hard to track down to canonical sources.

If I set up a collator with the language specifier ja, which sort order should I expect to be used?

If several are available for Japanese, or are planned to be available at some point, which specifiers should be used for those? For example the specifier for the traditional alphabetical order of Spanish is es-u-co-trad.

The trouble with Kanji is covered [pretty well here](http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-%E2%80%94-an-unsolved-problem/). — Hans Passant, Apr 26 '15 at 07:53
Yes I'm sure there's no perfect solution given the obstacles, but I still want to know how many "good as we can do" solutions are standardized and named and what specific limits each has. — hippietrail, Apr 26 '15 at 07:56

score 3 · Answer 1 · answered Apr 26 '15 at 11:23

3

The basic Japanese sort order provided by the CLDR (and therefore ICU) is based on the sort order specified in JIS X 4061-1996:

Kana are sorted by their gojuuon (五十音) order (with Hiragana preceding Katakana).
Kanji are sorted by their order in JIS X 0208, which is by their "representative reading" (and following all Kana).

A ja-u-co-unihan collation is also available, which includes the rules for sorting radicals by their stroke order (followed by the standard rules above). This only useful if you are actually sorting radicals.

If you need more accurate sorting of Kanji—for instance, by the reading of the words they are used in—you will need to perform some kind of morphological analysis with a dictionary to figure out what readings to use, and then apply the Unicode Collation Algorithm on those.

answered Apr 26 '15 at 11:23

一二三

21,059
11
65
74

Thanks for this info! I'm providing a list sorting extension for Wiktionary and need to let the Japanese experts there know what the options are and whether their preferred sort order is possible to do automatically. What is specified to happen for CJKV characters not covered by `JIS X 4061-1996`, assuming it does not cover all han characters? – hippietrail Apr 26 '15 at 12:13
1

All other CJKV characters ("only" 6,355 are specified) fallback to their default (code point) order; following Kana and all sorted Kanji. This is roughly by radical and then number of strokes (but this breaks down when the extension and compatibility blocks are considered). – 一二三 Apr 26 '15 at 12:38
In fact for the Chinese case I was told I was sorting wrongly after implementing the CLDR default sorting via the browser/DOM API. English Wiktionary sorts Chinese by Pinyin alphabetical order. I forget which order the CLDR did by default, probably radical/stroke. I did not find out if I could pass any parameter to get a different Chinese sort order. – hippietrail Sep 09 '16 at 12:43

Which Japanese sorting / collation orders are supported by ICU / CLDR / UCA?

1 Answers1

Linked