0

I am trying to write a program that can transliterate CJK to Latin (i.e Pinyin, Romaji, etc.). For example you give a Chinese, Japanese or Korean document as input and then you get the transliterated version into Latin as output.

I am new in this field so please bear with me here.

Obviously, first I need to detect the type of the language (Chinese, Japanese or Korean) before getting any further. Then, as I understood so far, in order to do the transliteration, I need to divide the text into words, since in these languages there is no space between words. This is called word segmentation. Finally after finding out the words I need to transliterate them into Latin.

So here is my question:

  1. There are lots of (well not really! Better say some) libraries that do the transliteration job, since I'm looking for open source ones in C/C++, I found Adson (only for Chinese) and ICU4C. Cloned Git repo from Adson didn't compile. And I was not able to find simple, straight forward tutorial for ICU4C. How can I find some tutorial on ICU4C usage? Do you know any other library to transliterate CJK to Latin? If the accuracy ratio is higher(~90%), I can forget about it being written in C++.
Community
  • 1
  • 1
mrz
  • 1,802
  • 2
  • 21
  • 32
  • For Korean it's quite simple since they don't use Chinese characters anymore. I have Python and Javascript code that you could translate to C(++) if you wish. – dda Dec 05 '12 at 08:23
  • Also, Korean *uses* whitespace, spaces are just as important in Korean as in English. – dda Dec 05 '12 at 08:24

1 Answers1

1

ICU: there are examples in http://userguide.icu-project.org/transforms/general and ICU 50 now has CJK word segmentation. The uconv sample can be used with something like uconv -f utf-8 -t utf-8 -x 'Any-Latin' to go through Any-Latin transform. That doesn't take language into account, though.

Steven R. Loomis
  • 4,228
  • 28
  • 39
  • Thank you, I'll go through the link you posted, but do you have any suggestion/thought on how I can detect what language is what? – mrz Nov 20 '12 at 06:08
  • 1
    See the closures - you should separate your question into multiple questions. No short answers on language detection— need a corpus and/or specialized code. Impossible to guess the right answer for a short string like "三" for example - is it "yī" or "san" or something else? – Steven R. Loomis Nov 20 '12 at 23:44