9

"Better" primarily means accuracy, but I am also interested in any other criteria in which other systems excel. I sampled the Perl binding Text::Kakasi for correctness in an admittedly limited fashion and it works just fine for our needs.

use utf8;
use Encode;
use Text::Kakasi;
use Unicode::Collate;

my $k = Text::Kakasi->new(qw(-iutf8 -outf8 -JH));
my $c = Unicode::Collate->new;

print encode_utf8 $_ for
    map  { $_->[0] }
    sort { $c->cmp($a->[1], $b->[1]) }
    map  { [$_, $k->get($_)] }
    <DATA>;

__DATA__
アメリカ合衆国
アラブ首長国連邦
ロシア連邦
中国
南アフリカ共和国
日本
北京(ペキン)
大阪
東京
dda
  • 6,030
  • 2
  • 25
  • 34
daxim
  • 39,270
  • 4
  • 65
  • 132
  • First, Kakasi is a converter which changes kanji into kana or romaji. It has nothing to do with collation. Do you want to find a better converter from kanji to kana? That isn't what you have asked. Second, what order do you want to sort the words? If you output the words sorted by the unicode values of the kana, you will get a different order from the order found in a Japanese dictionary. –  Oct 10 '10 at 05:19
  • 4
    It sure must take some effort to deliberately misunderstand the question topic and totally ignore the sample program! – daxim Oct 10 '10 at 13:44
  • I'm not really good with perl, but is that for sorting purposes? – Bogdan Maxim Nov 18 '10 at 14:47
  • If it is, then here is a helper: http://stackoverflow.com/questions/3891556/how-do-you-sort-cjk-asian-characters-in-perl-or-with-any-other-programming-lan – Bogdan Maxim Nov 18 '10 at 14:47

3 Answers3

5

The only other (serious) open-source conversion tool I know of is N-gram, not the most explicit name... It has huge dictionaries, and might be better than Kakasi. But I haven't seen any comparisons out there.

EDIT:

I gave some thought to the notion of "betterness" of one libray over others in this context. One thing that could be done is to take the dictionaries of N-gram and run them against kakasi. If kakasi fails to convert some of N-gram's entries, it could be said that N-gram's better because its lexicon is richer -- enhancing the accuracy of the collation.

However, since the corpus of Kanji-based words (which need to be converted into kana to be collated properly) is not finite - family names among others are a big problem, as they can be read almost any way you can imagine - there can't be a solution that provides 100% coverage. But the OP asked for a "better" solution, not a perfect one...

dda
  • 6,030
  • 2
  • 25
  • 34
2

Considering that all that Kakasi does is just pulling kana/romaji from supplied dictionaries for specific Japanese strings, you can hardly have anything more precise. Precision depends on quality of used dictionaries.

Oleg V. Volkov
  • 21,719
  • 4
  • 44
  • 68
  • Kakasi bundles a dictionary. You are not answering my question whether there is something better. This answer is not useful. – daxim May 22 '12 at 12:50
2

I am not sure about meaning of 'authoritative'.

But I can say Kakashi is well known freeware library and still not obsolete today.

If you can convert Kanji strings to Hiragana(or Katakana) strings by Kakashi, resulting sorting order would be fine.

http://www.utf8-chartable.de/unicode-utf8-table.pl

kmugitani
  • 615
  • 1
  • 6
  • 13
  • I was not asking whether the kakasi library is obsolete, but whether there is something better. – daxim Oct 10 '10 at 13:37