15

The question

How can you change all accented letters to normal letters in C++ (or in C)?

By that, I mean something like eéèêaàäâçc would become eeeeaaaacc.

What I've already tried

I've tried just parsing the string manually and replacing each one of them one by one, but I was thinking there has to be a better/simpler way that I am not aware of (that would garantee I do not forget any accented letter).

I am wondering if there is already a map somewhere in the standard library or if all the accented characters can easily be mapped to the "normal" letter using some mathematic function (ex. floor(charCode-131/5) + 61)).

OneMore
  • 1,139
  • 2
  • 9
  • 19

7 Answers7

12
char* removeAccented( char* str ) {
    char *p = str;
    while ( (*p)!=0 ) {
        const char*
        //   "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
        tr = "AAAAAAECEEEEIIIIDNOOOOOx0UUUUYPsaaaaaaeceeeeiiiiOnooooo/0uuuuypy";
        unsigned char ch = (*p);
        if ( ch >=192 ) {
            (*p) = tr[ ch-192 ];
        }
        ++p; // http://stackoverflow.com/questions/14094621/
    }
    return str;
}
Adolfo
  • 281
  • 2
  • 5
  • This look as elegant solution but you should check the ch < 256 because on some platforms char > 255 and then the array will overflow for characters as Žž – Pavel Jiri Strnad Aug 10 '23 at 06:36
8

You should first define what you mean by "accented letters" what has to be done is largely different if what you have is say some extended 8 bits ASCII with a national codepage for codes above 128, or say some utf8 encoded string.

However you should have a look at libicu which provide what is necessary for good unicode based accented letters manipulation.

But it won't solve all problems for you. For instance what should you do if you get some chinese or russian letter ? What should you do if you get the Turkish uppercase I with point ? Remove the point on this "I" ? Doing so it would change the meaning of the text... etc. This kind of problems are endless with unicode. Even conventional sorting order depends of the country...

kriss
  • 23,497
  • 17
  • 97
  • 116
  • 1
    Using ICU is the correct answer. It is already installed and available on every modern system, even devices like your iPhone and Android. You only need to link to it to use it. (How to do that depends on the system/OS, of course.) – Dúthomhas Nov 27 '22 at 02:46
7

I know it only in theory. Basically, you perform Unicode normalization, then some decomposition, purge all diacritics, and recompose again.

Joker_vD
  • 3,715
  • 1
  • 28
  • 42
2

Assuming the values are just chars, I'd create an array with the desired target values and then just replace each character with the corresponding member in the array:

char replacement[256];
int n(0);
std::generate_n(replacement, 256, [=]() mutable -> unsigned char { return n++; });
replacement[static_cast<unsigned char>('é')] = 'e';
// ...
std::transform(s.begin(), s.end(), s.begin(),
               [&](unsigned char c){ return replacement[c]; });

Since the question is also tagged with C: when using C you'd need to create suitable loops to do the same operations but conceptually it would just same way. Similarily, if you can't use C++ 2011, you'd just use suitable function objects instead of the lambda functions.

Obviously, the replacement array can be set up just once and using a smarter approach than what is outlined above. However, the principle should work. If you need to replace Unicode characters thing become a bit more interesting, though: For one, the array would be fairly large and in addition the character may need multiple words to be changed.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • "Assuming" --- that's one big assumption. – n. m. could be an AI Dec 30 '12 at 21:45
  • 2
    They aren't `char`s, at least if the input is UTF-8, which it should be. – phihag Dec 31 '12 at 01:54
  • @phihag: It seems the characters mentioned are covered by ISO/IEC 8859-1 (ISO Latin1). If this is sufficient why bother with UTF-8? Also, the statement about the formula seems to indicate a somewhat simplistic approach to represent these characters. – Dietmar Kühl Dec 31 '12 at 02:00
  • @DietmarKühl The characters may be *covered* by Latin-1, but are not necessarily *encoded* in Latin-1. – phihag Dec 31 '12 at 02:03
2

Here is what you can do using ISO/IEC 8859-1 (ASCII-based standard character encoding):

  • if code range is from 192 - 197 replace with A
  • if code range is from 224 - 229 replace with a
  • if code range is from 200 - 203 replace with E
  • if code range is from 232 - 235 replace with e
  • if code range is from 204 - 207 replace with I
  • if code range is from 236 - 239 replace with i
  • if code range is from 210 - 214 replace with O
  • if code range is from 242 - 246 replace with o
  • if code range is from 217 - 220 replace with U
  • if code range is from 249 - 252 replace with u

Supposing x is the code of the number, perform the following for capital letters:

  • y = floor((x - 192) / 6)
  • if y <= 2 then z = ((y + 1) * 4) + 61 else z = (y * 6) + 61

Perform the following for small letters:

  • y = floor((x - 224) / 6)
  • if y <= 2 then z = ((y + 1) * 4) + 93 else z = (y * 6) + 93

The final answer z is the ASCII code of the required alphabet.
Note that this method works only if you are using ISO/IEC 8859-1.

bane
  • 811
  • 1
  • 8
  • 23
  • This worked form me in Java, I guess it should work in C++. Tell me if it doesn't. – bane Dec 30 '12 at 21:51
  • 2
    It will certainly work with [ASCII](http://en.wikipedia.org/wiki/Ascii) in a trivial way, because there are no characters greater than 127 there. It may work with [something that is not ASCII](http://en.wikipedia.org/wiki/ISO/IEC_8859-1), and it will probably not work with [other](http://en.wikipedia.org/wiki/ISO/IEC_8859-2) [things](http://en.wikipedia.org/wiki/ISO/IEC_8859-3) [that](http://en.wikipedia.org/wiki/ISO/IEC_8859-4) [are](http://en.wikipedia.org/wiki/ISO/IEC_8859-14) [not ASCII](http://en.wikipedia.org/wiki/ISO/IEC_8859-16). – n. m. could be an AI Dec 30 '12 at 22:07
  • 2
    If we are restricted to the 0-255 range I would probably use a lookup table instead of all those `if`s. – Matteo Italia Dec 30 '12 at 23:12
1

I am afraid there is no easy way around here.

In application I work on this was solved by using internal codepage tables, each codepage table (like 1250, 1251, 1252, etc) contained actual codepage letter and non-diacritic equivalent. Tables were auto generated using c#, it contains some classes that really make that easy (with some heuristics actually), also java allows to implement it quicly.

This was actually for multibyte data with codepages, but it could be used for UNICODE strings (by just searching all tables for given unicode letter).

marcinj
  • 48,511
  • 9
  • 79
  • 100
0

My use case was needing to do a case-insensitive sort a long list of strings, where some of the strings might have diacriticals. So for instance I wanted "Añasco Municipio" to come right before "Anchorage Municipality", instead of coming right before "Abbeville County" as it was doing with a naive comparison.

My strings are encoded in UTF-8, but there's a chance that they might contain some extended ascii characters instead of proper UTF-8 Unicode. I could have promoted all strings to UTF-8, and then used a library that could do UTF-8 string comparison, but I wanted to have full control both for speed and for deciding exactly how diacritical characters are mapped to non-diacritical characters. (My choices include things like treating the masculine ordinal indicator as "o", and treating the copyright character as c.)

The "two-byte" codes below are UTF-8 sequences. The "one-byte" codes are extended ascii.

This is where I got the codes:

http://www.ascii-code.com/

http://www.endmemo.com/unicode/unicodeconverter.php

void SimplifyStringForSorting( string *s, bool changeToLowerCase )
{
    // C0 C1 C2 C3 C4 C5 E0 E1 E2 E3 E4 E5 AA // one-byte codes for "a"
    // C3 80 C3 81 C3 82 C3 83 C3 84 C3 85 C3 A0 C3 A1 C3 A2 C3 A3 C3 A4 C3 A5 C2 AA // two-byte codes for "a"
    
    // C8 C9 CA CB E8 E9 EA EB // one-byte codes for "e"
    // C3 88 C3 89 C3 8A C3 8B C3 A8 C3 A9 C3 AA C3 AB // two-byte codes for "e"
    
    // CC CD CE CF EC ED EE EF // one-byte codes for "i"
    // C3 8C C3 8D C3 8E C3 8F C3 AC C3 AD C3 AE C3 AF // two-byte codes for "i"
    
    // D2 D3 D4 D5 D6 F2 F3 F4 F5 F6 BA // one-byte codes for "o"
    // C3 92 C3 93 C3 94 C3 95 C3 96 C3 B2 C3 B3 C3 B4 C3 B5 C3 B6 C2 BA // two-byte codes for "o"
    
    // D9 DA DB DC F9 FA FB FC // one-byte codes for "u"
    // C3 99 C3 9A C3 9B C3 9C C3 B9 C3 BA C3 BB C3 BC // two-byte codes for "u"
    
    // A9 C7 E7 // one-byte codes for "c"
    // C2 A9 C3 87 C3 A7 // two-byte codes for "c"
    
    // D1 F1 // one-byte codes for "n"
    // C3 91 C3 B1 // two-byte codes for "n"
    
    // AE // one-byte codes for "r"
    // C2 AE // two-byte codes for "r"
    
    // DF // one-byte codes for "s"
    // C3 9F // two-byte codes for "s"
    
    // 8E 9E // one-byte codes for "z"
    // C5 BD C5 BE // two-byte codes for "z"
    
    // 9F DD FD FF // one-byte codes for "y"
    // C5 B8 C3 9D C3 BD C3 BF // two-byte codes for "y"
    
    int n = s->size();
    int pos = 0;
    for ( int i = 0 ; i < n ; i++, pos++ )
    {
        unsigned char c = (unsigned char)s->at( i );
        if ( c >= 0x80 )
        {
            if ( i < ( n - 1 ) && (unsigned char)s->at( i + 1 ) >= 0x80 )
            {
                unsigned char c2 = SimplifyDoubleCharForSorting( c, (unsigned char)s->at( i + 1 ), changeToLowerCase );
                if ( c2 < 0x80 )
                {
                    s->at( pos ) = c2;
                    i++;
                }
                else
                {
                    // s->at( pos ) = SimplifySingleCharForSorting( c, changeToLowerCase );
                    // if it's a double code we don't recognize, skip both characters;
                    // this does mean that we lose the chance to handle back-to-back extended ascii characters
                    // but we'll assume that is less likely than a unicode "combining character" or other
                    // unrecognized unicode character for data
                    i++;
                }
            }
            else
            {
                unsigned char c2 = SimplifySingleCharForSorting( c, changeToLowerCase );
                if ( c2 < 0x80 )
                {
                    s->at( pos ) = c2;
                }
                else
                {
                    // skip unrecognized single-byte codes
                    pos--;
                }
            }
        }
        else
        {
            if ( changeToLowerCase && c >= 'A' && c <= 'Z' )
            {
                s->at( pos ) = c + ( 'a' - 'A' );
            }
            else
            {
                s->at( pos ) = c;
            }
        }
    }
    if ( pos < n )
    {
        s->resize( pos );
    }
}

unsigned char SimplifyDoubleCharForSorting( unsigned char c1, unsigned char c2, bool changeToLowerCase )
{
    // C3 80 C3 81 C3 82 C3 83 C3 84 C3 85 C3 A0 C3 A1 C3 A2 C3 A3 C3 A4 C3 A5 C2 AA // two-byte codes for "a"
    // C3 88 C3 89 C3 8A C3 8B C3 A8 C3 A9 C3 AA C3 AB // two-byte codes for "e"
    // C3 8C C3 8D C3 8E C3 8F C3 AC C3 AD C3 AE C3 AF // two-byte codes for "i"
    // C3 92 C3 93 C3 94 C3 95 C3 96 C3 B2 C3 B3 C3 B4 C3 B5 C3 B6 C2 BA // two-byte codes for "o"
    // C3 99 C3 9A C3 9B C3 9C C3 B9 C3 BA C3 BB C3 BC // two-byte codes for "u"
    // C2 A9 C3 87 C3 A7 // two-byte codes for "c"
    // C3 91 C3 B1 // two-byte codes for "n"
    // C2 AE // two-byte codes for "r"
    // C3 9F // two-byte codes for "s"
    // C5 BD C5 BE // two-byte codes for "z"
    // C5 B8 C3 9D C3 BD C3 BF // two-byte codes for "y"
    
    if ( c1 == 0xC2 )
    {
        if ( c2 == 0xAA ) { return 'a'; }
        if ( c2 == 0xBA ) { return 'o'; }
        if ( c2 == 0xA9 ) { return 'c'; }
        if ( c2 == 0xAE ) { return 'r'; }
    }
    
    if ( c1 == 0xC3 )
    {
        if ( c2 >= 0x80 && c2 <= 0x85 ) { return changeToLowerCase ? 'a' : 'A'; }
        if ( c2 >= 0xA0 && c2 <= 0xA5 ) { return 'a'; }
        if ( c2 >= 0x88 && c2 <= 0x8B ) { return changeToLowerCase ? 'e' : 'E'; }
        if ( c2 >= 0xA8 && c2 <= 0xAB ) { return 'e'; }
        if ( c2 >= 0x8C && c2 <= 0x8F ) { return changeToLowerCase ? 'i' : 'I'; }
        if ( c2 >= 0xAC && c2 <= 0xAF ) { return 'i'; }
        if ( c2 >= 0x92 && c2 <= 0x96 ) { return changeToLowerCase ? 'o' : 'O'; }
        if ( c2 >= 0xB2 && c2 <= 0xB6 ) { return 'o'; }
        if ( c2 >= 0x99 && c2 <= 0x9C ) { return changeToLowerCase ? 'u' : 'U'; }
        if ( c2 >= 0xB9 && c2 <= 0xBC ) { return 'u'; }
        if ( c2 == 0x87 ) { return changeToLowerCase ? 'c' : 'C'; }
        if ( c2 == 0xA7 ) { return 'c'; }
        if ( c2 == 0x91 ) { return changeToLowerCase ? 'n' : 'N'; }
        if ( c2 == 0xB1 ) { return 'n'; }
        if ( c2 == 0x9F ) { return 's'; }
        if ( c2 == 0x9D ) { return changeToLowerCase ? 'y' : 'Y'; }
        if ( c2 == 0xBD || c2 == 0xBF ) { return 'y'; }
    }
    
    if ( c1 == 0xC5 )
    {
        if ( c2 == 0xBD ) { return changeToLowerCase ? 'z' : 'Z'; }
        if ( c2 == 0xBE ) { return 'z'; }
        if ( c2 == 0xB8 ) { return changeToLowerCase ? 'y' : 'Y'; }
    }
    
    return c1;
}

unsigned char SimplifySingleCharForSorting( unsigned char c, bool changeToLowerCase )
{
    // C0 C1 C2 C3 C4 C5 E0 E1 E2 E3 E4 E5 AA // one-byte codes for "a"
    // C8 C9 CA CB E8 E9 EA EB // one-byte codes for "e"
    // CC CD CE CF EC ED EE EF // one-byte codes for "i"
    // D2 D3 D4 D5 D6 F2 F3 F4 F5 F6 BA // one-byte codes for "o"
    // D9 DA DB DC F9 FA FB FC // one-byte codes for "u"
    // A9 C7 E7 // one-byte codes for "c"
    // D1 F1 // one-byte codes for "n"
    // AE // one-byte codes for "r"
    // DF // one-byte codes for "s"
    // 8E 9E // one-byte codes for "z"
    // 9F DD FD FF // one-byte codes for "y"
    
    if ( ( c >= 0xC0 && c <= 0xC5 ) || ( c >= 0xE1 && c <= 0xE5 ) || c == 0xAA )
    {
        return ( ( c >= 0xC0 && c <= 0xC5 ) && !changeToLowerCase ) ? 'A' : 'a';
    }
    
    if ( ( c >= 0xC8 && c <= 0xCB ) || ( c >= 0xE8 && c <= 0xEB ) )
    {
        return ( c > 0xCB || changeToLowerCase ) ? 'e' : 'E';
    }
    
    if ( ( c >= 0xCC && c <= 0xCF ) || ( c >= 0xEC && c <= 0xEF ) )
    {
        return ( c > 0xCF || changeToLowerCase ) ? 'i' : 'I';
    }
    
    if ( ( c >= 0xD2 && c <= 0xD6 ) || ( c >= 0xF2 && c <= 0xF6 ) || c == 0xBA )
    {
        return ( ( c >= 0xD2 && c <= 0xD6 ) && !changeToLowerCase ) ? 'O' : 'o';
    }
    
    if ( ( c >= 0xD9 && c <= 0xDC ) || ( c >= 0xF9 && c <= 0xFC ) )
    {
        return ( c > 0xDC || changeToLowerCase ) ? 'u' : 'U';
    }
    
    if ( c == 0xA9 || c == 0xC7 || c == 0xE7 )
    {
        return ( c == 0xC7 && !changeToLowerCase ) ? 'C' : 'c';
    }
    
    if ( c == 0xD1 || c == 0xF1 )
    {
        return ( c == 0xD1 && !changeToLowerCase ) ? 'N' : 'n';
    }
    
    if ( c == 0xAE )
    {
        return 'r';
    }
    
    if ( c == 0xDF )
    {
        return 's';
    }
    
    if ( c == 0x8E || c == 0x9E )
    {
        return ( c == 0x8E && !changeToLowerCase ) ? 'Z' : 'z';
    }
    
    if ( c == 0x9F || c == 0xDD || c == 0xFD || c == 0xFF )
    {
        return ( ( c == 0x9F || c == 0xDD ) && !changeToLowerCase ) ? 'Y' : 'y';
    }
    
    return c;
}
M Katz
  • 5,098
  • 3
  • 44
  • 66
  • That doesn't look like it handles combining characters. – melpomene Jul 02 '16 at 09:31
  • @melpomene: Then again, with UTF8 input, one can immediately remove all combining accents because there are no combining accents that map to an accent-less character in the lower ASCII range. (And thus you could argue that it should remove all Chinese characters, since *none* of these map to lower ASCII.) – Jongware Jul 02 '16 at 10:37
  • melpomene and Rad Lexus: Thanks, yeah, I wasn't thinking about combining characters. I have modified the code to skip unrecognized two-character sequences (see comment in code), which will drop the combining codes, as well as other unrecognized two-byte sequences. I also now ignore unrecognized single-byte sequences, and hopefully this will handle Chinese, three-byte sequences, and so on, in a reasonable way. I realize all of this is strange trying to handle UTF-8 and extended ascii together, but hopefully it works well for English-centric sorting. – M Katz Jul 02 '16 at 18:14
  • By the way, do you have a sense for how common combining characters are out there "in the wild" as compared to precomposed characters? My guess is that combining characters are relatively rare for the "normal" extended-ascii type characters like accents and graves on vowels, etc.? – M Katz Jul 02 '16 at 18:20
  • Nice work; exactly what I was looking for! question (for your long ago code): after the call to SimplifySingleChar, shouldn't `s->at( pos );` be `'s->at( pos ) = c2;`? – mackworth Nov 25 '22 at 20:48
  • @mackworth, yes, sorry about that. I had actually fixed it long ago in my codebase but failed to fix it here.. It's fixed now. – M Katz Nov 27 '22 at 02:23