4

I can get ICU to transliterate to Latin using "Any-Latin" but this still includes characters, e.g. macrons, that are not in the Latin1 codepage. I can get it to transliterate to ASCII using "Any-Latin; Latin-ASCII" but then I lose all the accented characters that are valid Latin1 characters. I need something inbetween that specifically does "Any-ISO_8859_1"

The only way I can see to do it is to build up a set of custom rules. E.g. convert to Latin and then remove macrons and anything else that is not Latin1:

UnicodeString Latin1_Rules(
    "::Any-Latin; "
    "::nfd; ::[\\u0304] remove; ::nfc;"
    // etc...
    );
// Create a custom Transliterator
icu::Transliterator* trans = icu::Transliterator::createFromRules("Latin1",
    Latin1_Rules,
    UTRANS_FORWARD,
    ...

But I'm not sure what other things I would need to remove and this solution just seems very clumsy and probably very slow and I'm not sure I'd ever be 100% confident that it would be right.

I'm not married to ICU if there is a better (simpler/faster) way. But I am stuck with C/C++.

To be clear, this is not the same question as Is there a way to convert from UTF8 to iso-8859-1? That question is just about converting between encodings when the content is already known to be only iso-8859-1. Conversion maps characters one-to-one and fails for any characters not supported by the target encoding.

My question is about transliteration. I want e.g. Chinese characters like 牛 to be transliterated to "Niú".

Community
  • 1
  • 1
Unripe
  • 51
  • 5
  • possible duplicate of [Is there a way to convert from UTF8 to iso-8859-1?](http://stackoverflow.com/questions/11156473/is-there-a-way-to-convert-from-utf8-to-iso-8859-1) – rubenvb Jan 15 '14 at 10:42
  • See specifically, the currently second answer ([this one](http://stackoverflow.com/a/11156490/256138)) – rubenvb Jan 15 '14 at 10:43
  • 2
    That question is about conversion, my question is about transliteration. – Unripe Jan 15 '14 at 10:47
  • How do I remove the link to the question about conversion? – Unripe Jan 15 '14 at 12:09
  • 1
    Since Chinese characters can't be mapped to the Latin alphabet, what you are trying to do is accurately called *transcription*, not *transliteration*. It should also be obvious that this can't be done purely algorithmically. You'll need a dictionary which translates Chinese characters to phonemes. – nwellnhof Jan 15 '14 at 12:41
  • Thanks @nwellnhof. I'm using ICU which has the data to translate Chinese characters to phonemes and it is working nicely. Except that, whilst the output is Latin it is not specifically Latin1. E.g. The Chinese character 拉 is transliterated to "Lā" but the macron character ā is not supported by Latin1. – Unripe Jan 15 '14 at 13:09
  • @nwellnhof yes, it's technically transcription. ICU should have named the functionality "transform" (in fact that is the name of the userguide chapter: http://userguide.icu-project.org/transforms ) but the API is named Transliterator. This process can include Han-Latin (using a table, certainly), but it is not included in ICU by default. – Steven R. Loomis Feb 07 '14 at 18:43
  • The best solution I can come up with is to use the following rules: "::Any-Latin; ::[^\\u0000-\\u00FF] Latin-ASCII; ::[\\u02B0-\\u02FF] remove;" This converts everything to Latin, then converts any non-Latin1 characters to ASCII and finally removes the spacing modifier letters which are inexplicably in the output of Any-Latin. – Unripe Mar 18 '14 at 15:15

0 Answers0