3


I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.

According to this document, the package supports both "Han-Latin" and "Latin-Han" conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)

Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.

While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same "latin" string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:

{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },

I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.

I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?

Thank you for your time

Makoto
  • 104,088
  • 27
  • 192
  • 230
NatHillard
  • 306
  • 2
  • 10

1 Answers1

3

Note that the data is from the CLDR project, http://cldr.unicode.org . The script pairs that ICU supports are many, ICU will attempt to use a pivot script ( such as Han to Latin to Russian ) which is why you can create transliterators such as "Any-Latin". You might try browsing the ICU and CLDR data set. The note at the top of the Han-Latin file says that it does not round trip.

Steven R. Loomis
  • 4,228
  • 28
  • 39
  • Hello and thank you for the fast (and authoritative!) response. It is good to know the source of the data, and I will be investigating CLDR for more detail. A more general question remains for me though, which is can you, or someone, provide an example that will produce Han text from Latin, or Latin-like, input? I have tried a myriad of combinations in the online demo but nothing yields Han text. I am aware of the pivoting, but it seems I can find no pivot route to produce Han script, even outside of a roundtrip context. – NatHillard Apr 29 '11 at 23:35
  • You're welcome. I think you are right about it being a difficult problem, it is basically that faced by input methods, which end up presenting different alternatives to users. You might ask around the CLDR users list. – Steven R. Loomis Apr 29 '11 at 23:48