Code to strip diacritical marks using ICU

Question

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.:

UnicodeString strip_diacritics( UnicodeString const &s ) {
    UnicodeString result;
    // ...
    return result;
}

Assume that s has already been normalized. Thanks.

Duplicate of: http://stackoverflow.com/questions/331279/how-to-change-diacritic-characters-to-non-diacritic-ones ? — Ben Burnett, Jun 07 '10 at 18:28
Neither that question nor any given answers use the ICU library. — Paul J. Lucas, Jun 07 '10 at 18:51
So what? The essential step is to decompose the string, then filter out the diacritics. Use the Normalizer2 class. — Hans Passant, Jun 07 '10 at 19:36
And I'm asking for exactly such a code snippet that "uses the Nornalizer2 class." — Paul J. Lucas, Jun 07 '10 at 21:43

Quentin Pradet · Answer 1 · 2013-09-23T08:22:25.880

19

ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string as an input and returns another UTF-8 std::string:

#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>

std::string desaxUTF8(const std::string& str) {
    // UTF-8 std::string -> UTF-16 UnicodeString
    UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));

    // Transliterate UTF-16 UnicodeString
    UErrorCode status = U_ZERO_ERROR;
    Transliterator *accentsConverter = Transliterator::createInstance(
        "NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
    accentsConverter->transliterate(source);
    // TODO: handle errors with status

    // UTF-16 UnicodeString -> UTF-8 std::string
    std::string result;
    source.toUTF8String(result);

    return result;
}

edited Sep 23 '13 at 08:22

answered Oct 25 '12 at 14:45

Quentin Pradet

4,691
2
29
41

Very useful. I would prefer [:Mn:] instead of [:M:] since it the latter removes vowel marks in Hindi texts which, I think, are meaningful. – Jyotirmoy Bhattacharya Oct 20 '13 at 16:18
@JyotirmoyBhattacharya The distintion Unicode makes is based on layout, not on semantics: this suits your needs for Hindi but is not a good idea overall. (And diacritics provide meaning in many languages.) Thanks for your comment! – Quentin Pradet Oct 21 '13 at 11:12
One example where recompose is necessary is the Hangul Syllables block, `U+AC00 - U+D7AF`. All of them decompose into two more letters in the Hangul Jamo, `U+1100 - U+11FF` block. For example `U+AC00` decomposes into `U+1100` and `U+1161` which are, again, letters (`Lo`) and not marks. – chx Jun 16 '20 at 05:32
The one given in https://unicode-org.github.io/icu/userguide/transforms/general/ is `NFD; [:Nonspacing Mark:] Remove; NFC` – Alexey Romanov Oct 21 '20 at 10:50
What is `[:M:]` and where did you find it? – arrowd Jun 13 '23 at 09:16
It's an Unicode General Category called Mark, see https://en.wikipedia.org/wiki/Unicode_character_property – Quentin Pradet Jul 07 '23 at 06:20

score -1 · Accepted Answer · answered Jun 08 '10 at 02:02

After more searching elsewhere:

UErrorCode status = U_ZERO_ERROR;
UnicodeString result;

// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
  // complain

// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided

string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
  char const c = *i;
  if ( isascii( c ) )
    buf8.push_back( c );
}
// result is in buf8

which is O(n).

You don't want to remove anything non-ASCII, just diacritics. This code only works on a few languages. — Quentin Pradet, Oct 25 '12 at 14:46

Code to strip diacritical marks using ICU

2 Answers2

Linked