5

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.:

UnicodeString strip_diacritics( UnicodeString const &s ) {
    UnicodeString result;
    // ...
    return result;
}

Assume that s has already been normalized. Thanks.

Paul J. Lucas
  • 6,895
  • 6
  • 44
  • 88

2 Answers2

19

ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string as an input and returns another UTF-8 std::string:

#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>

std::string desaxUTF8(const std::string& str) {
    // UTF-8 std::string -> UTF-16 UnicodeString
    UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));

    // Transliterate UTF-16 UnicodeString
    UErrorCode status = U_ZERO_ERROR;
    Transliterator *accentsConverter = Transliterator::createInstance(
        "NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
    accentsConverter->transliterate(source);
    // TODO: handle errors with status

    // UTF-16 UnicodeString -> UTF-8 std::string
    std::string result;
    source.toUTF8String(result);

    return result;
}
Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41
  • Very useful. I would prefer [:Mn:] instead of [:M:] since it the latter removes vowel marks in Hindi texts which, I think, are meaningful. – Jyotirmoy Bhattacharya Oct 20 '13 at 16:18
  • @JyotirmoyBhattacharya The distintion Unicode makes is based on layout, not on semantics: this suits your needs for Hindi but is not a good idea overall. (And diacritics provide meaning in many languages.) Thanks for your comment! – Quentin Pradet Oct 21 '13 at 11:12
  • One example where recompose is necessary is the Hangul Syllables block, `U+AC00 - U+D7AF`. All of them decompose into two more letters in the Hangul Jamo, `U+1100 - U+11FF` block. For example `U+AC00` decomposes into `U+1100` and `U+1161` which are, again, letters (`Lo`) and not marks. – chx Jun 16 '20 at 05:32
  • The one given in https://unicode-org.github.io/icu/userguide/transforms/general/ is `NFD; [:Nonspacing Mark:] Remove; NFC` – Alexey Romanov Oct 21 '20 at 10:50
  • What is `[:M:]` and where did you find it? – arrowd Jun 13 '23 at 09:16
  • It's an Unicode General Category called Mark, see https://en.wikipedia.org/wiki/Unicode_character_property – Quentin Pradet Jul 07 '23 at 06:20
-1

After more searching elsewhere:

UErrorCode status = U_ZERO_ERROR;
UnicodeString result;

// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
  // complain

// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided

string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
  char const c = *i;
  if ( isascii( c ) )
    buf8.push_back( c );
}
// result is in buf8

which is O(n).

Paul J. Lucas
  • 6,895
  • 6
  • 44
  • 88