I can get ICU to transliterate to Latin using "Any-Latin" but this still includes characters, e.g. macrons, that are not in the Latin1 codepage. I can get it to transliterate to ASCII using "Any-Latin; Latin-ASCII" but then I lose all the accented characters that are valid Latin1 characters. I need something inbetween that specifically does "Any-ISO_8859_1"
The only way I can see to do it is to build up a set of custom rules. E.g. convert to Latin and then remove macrons and anything else that is not Latin1:
UnicodeString Latin1_Rules(
"::Any-Latin; "
"::nfd; ::[\\u0304] remove; ::nfc;"
// etc...
);
// Create a custom Transliterator
icu::Transliterator* trans = icu::Transliterator::createFromRules("Latin1",
Latin1_Rules,
UTRANS_FORWARD,
...
But I'm not sure what other things I would need to remove and this solution just seems very clumsy and probably very slow and I'm not sure I'd ever be 100% confident that it would be right.
I'm not married to ICU if there is a better (simpler/faster) way. But I am stuck with C/C++.
To be clear, this is not the same question as Is there a way to convert from UTF8 to iso-8859-1? That question is just about converting between encodings when the content is already known to be only iso-8859-1. Conversion maps characters one-to-one and fails for any characters not supported by the target encoding.
My question is about transliteration. I want e.g. Chinese characters like 牛 to be transliterated to "Niú".