2

I'm looking for a clean way to decompose ligatures (e.g. œ -> oe) in a Unicode string. Is there a way to do this without enumerating all the rules one by one? Something like this rule for removing diacritical marks but for ligatures instead.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
rch
  • 23
  • 2

1 Answers1

0

Reading the ICU documentation on transforms, it seems you need this one

Latin-ASCII

So I suppose passing doing something similar as your linked question should work (code is untested):

#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>

std::string asciiifyUTF8(const std::string& str)
{
    // UTF-8 std::string -> UTF-16 UnicodeString
    UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));

    // Transliterate UTF-16 UnicodeString
    UErrorCode status = U_ZERO_ERROR;
    Transliterator *asciiConverter = Transliterator::createInstance(
        "Latin-ASCII", UTRANS_FORWARD, status);
    asciiConverter->transliterate(source);
    // TODO: handle errors with status

    // UTF-16 UnicodeString -> UTF-8 std::string
    std::string result;
    source.toUTF8String(result);

    return result;
}

Note this will also strip accents and the like, as the output will be limited to ASCII characters. You can filter the input the transform applies to, but the (French) ligature characters aren't in any special region (there are really only two in French)? Browsing through the Unicode characters, there seem to be a whole lot more ligatures in various other scripts, so the above suggestion might not be what you are looking for.

rubenvb
  • 74,642
  • 33
  • 187
  • 332
  • This doesn't exactly answer the question (since this rule does a lot more than decomposing ligatures) but it happens to fit my needs perfectly, thanks a lot! I'm just not sure if I should accept it... – rch Sep 01 '21 at 10:31
  • It will definitely screw with a lot of things besides the ligatures. I can't find any other transformation that does exactly (and only) what you want though, the closest is compatibility normalization, but that also changes a lot of other characters which is equally unintended... – rubenvb Sep 01 '21 at 10:37
  • I appear to be looking for the same thing, but really it should convert all ligatures for all scripts, not just latin languages, so it should handle non ASCII code points as well. Possibly the `NFKC` and/or `NFKD` [transformations](http://www.unicode.org/reports/tr15/#Compatibility_Composite_Figure) could be what you were looking for, but I still have to test it. – ceztko Mar 22 '23 at 15:02