11

How do you sort Chinese, Japanese and Korean (CJK) characters in Perl?

As far as I can tell, sorting CJK characters by stroke count, then by radical, seems to be the way these languages are sorted. There are also some methods that sort by sounds, but this seems less common.

I've tried using:

perl -e 'print join(" ", sort qw(工 然 一 人 三 古 二 )), "\n";'
# Prints: 一 三 二 人 古 工 然 which is incorrect

And I've tried using Unicode::Collate from CPAN, but it says:

By default, CJK Unified Ideographs are ordered in Unicode codepoint order...

If I could get a database of stroke count per character, I could easily sort all of the characters, but this doesn't seem to come with Perl nor is it encapsulated in any module I could find.

If you know how to sort CJK in other languages, it would be helpful to mention it in an answer to this question.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Neil
  • 24,551
  • 15
  • 60
  • 81
  • 1
    This is a silly question. "How do you sort Chinese words?" or "How do you sort Korean words?" would make sense, but "How do you sort CJK characters?" doesn't make any sense. –  Oct 09 '10 at 16:12
  • It makes perfect sense, because in most charmaps that support multiple Asian languages, Chinese, Japanese, and Korean are lumped together into "CJK". – Andy Mar 21 '14 at 19:38

3 Answers3

4

See TR38 for the dirty details and corner cases. It's not as easy as you think and as this code sample looks like.

use 5.010;
use utf8;
use Encode;
use Unicode::Unihan;
my $u = Unicode::Unihan->new;

say encode_utf8 sprintf "Character $_ has the radical #%s and %d residual strokes." , split /[.]/, $u->RSUnicode($_) for qw(工 然 一 人 三 古 二);
__END__
Character 工 has the radical #48 and 0 residual strokes.
Character 然 has the radical #86 and 8 residual strokes.
Character 一 has the radical #1 and 0 residual strokes.
Character 人 has the radical #9 and 0 residual strokes.
Character 三 has the radical #1 and 2 residual strokes.
Character 古 has the radical #30 and 2 residual strokes.
Character 二 has the radical #7 and 0 residual strokes.

See http://en.wikipedia.org/wiki/List_of_Kangxi_radicals for a mapping from radical ordinal number to stroke count.

daxim
  • 39,270
  • 4
  • 65
  • 132
  • Do you know how to use the Unicode::Collate module? Specifically do you know how to pass a sub{} as the overrideCJK parameter, and have it actually run when Unicode::Collate->sort() is run? I could use Unicode::Unihan to get the stroke count and radical info to actually sort characters, but the overrideCJK function doesn't execute. – Neil Oct 08 '10 at 20:28
  • 1
    No, but you can [open a new question](http://stackoverflow.com/questions/ask) for that topic. – daxim Oct 08 '10 at 21:04
  • Considering how silly the question is, an answer as silly as this deserves to be accepted. There is no meaning to the notion of "sorting CJK characters". –  Oct 09 '10 at 16:13
  • The bigger part of the question is about sorting by stroke count, which is easily achieved. Don't make me call you a fool. – daxim Oct 09 '10 at 16:22
  • 1
    @daxim: Do you have a specific example of where someone has needed or would ever need to sort Chinese characters without regard to the underlying language? It's a silly question, and a silly answer. –  Oct 10 '10 at 00:23
  • @Kinopiko: I meant "sorting CJK phrases", which you need to do in the same situations when you sort English phrases, such as in index of a book, or whenever you want to write a list where people can find things. However, to sort a phrase you need to first sort characters. – Neil Oct 11 '10 at 04:39
  • 1
    @Neil: If you want to sort Japanese phrases, there is an answer for that. If you want to sort Chinese phrases, that is another question. If you want to sort Korean phrases, that is another question. But there is no such thing as "sorting CJK phrases" - it doesn't mean anything to sort words from three different languages. –  Oct 11 '10 at 06:13
2

A Japanese phonebook is sorted on a phonetic basis (gojûon collation). However, kanji character order is not based on phonetics, no matter whether in Unicode, JIS, S-JIS or EUC. Only kana are based on phonetic order. This means you can not collate meaningfully without phonetic conversion!

For example:

a) kanji:           東京駅
b) kana converted:  とうきょうえき
c) romanisation:    tôkyô eki

With b) or c), you can make a meaningful sort. But you can not do with only a). Of course, you can run the plain sort function, but it is not meaningful for Japanese.

daxim
  • 39,270
  • 4
  • 65
  • 132
kmugitani
  • 615
  • 1
  • 6
  • 13
  • That's answering a sane question, "How do you sort Japanese words?", but it doesn't answer the question which was actually asked, so I can't upvote it. –  Oct 09 '10 at 16:14
  • @Kinopiko: Yah, I have to agree with you. Original question is not good one. – kmugitani Oct 10 '10 at 07:07
2

Check out my rubygem toPinyin, which will convert a UTF-8 encoded chinese character to their PinYin (pronunciation). And then, a sort could be done on the Pinyin easily.

Simply, gem install toPinyin

require 'toPinyin'

words = "
人
没有
理想
跟
咸鱼
有
什么
区别
".split("\n")

words.sort! {|a ,b|   a.pinyin.join <=> b.pinyin.join }

https://github.com/pierrchen/toPinyin

pierrotlefou
  • 39,805
  • 37
  • 135
  • 175