Ignore accented letters while filtering a string in C#

Question

In the following block of code,the letters accented are not recognized (i fall into the "else")

           StringBuilder sb = new StringBuilder();
           foreach (char c in str) {
              if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_') {
                 sb.Append(c);
              }
              else
              {
               // if c is accented, i arrive here
              }

What can i do to ignore accents? thanks for your help

For example, if a letter is accented i want it to be in my stringbuilder! — FieryA, Sep 17 '15 at 14:10
http://stackoverflow.com/questions/359827/ignoring-accented-letters-in-string-comparison — Matteo Umili, Sep 17 '15 at 14:12
So if `str` were equal to `ÀÁÂÃÄÅ BBBBBB` you'd still want that entire string, except for the space, to be added to the string builder? — sab669, Sep 17 '15 at 14:13

sstan · Accepted Answer · 2015-09-17T14:24:55.703

8

Consider using char.IsLetterOrDigit(c).

Indicates whether the specified Unicode character is categorized as a letter or a decimal digit.

if (char.IsLetterOrDigit(c) || c == '.' || c == '_') {
    sb.Append(c);
}

The functions returns true for any letter, including accented ones.

edited Sep 17 '15 at 14:24

answered Sep 17 '15 at 14:13

sstan

35,425
6
48
66

score 2 · Answer 2 · answered Sep 17 '15 at 14:18

2

How about just cleaning up the strings by removing accents and diacritics?

public string RemoveAccentsAndDiacritics(string s)
{
    return string.Concat(
        s.Normalize(NormalizationForm.FormD)
         .Where(c => System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) !=
                     System.Globalization.UnicodeCategory.NonSpacingMark));
}

answered Sep 17 '15 at 14:18

spender

117,338
33
229
351

Could you explain what this does? I'm not a *great* .NET developer but I've been using it for a good while. I'd have to go and look up almost everything here; what is `Normalize` or `NormalizationForm` or what exactly is included in `NonSpacingMark`? Compared to sstan's answer this seems remarkably over-engineered and is not very readable. Don't get me wrong, I'm not afraid to go and read documentation on MSDN but for an answer this leaves a lot of guess work up to the question askwer. This is the kind of answer that has a dev writing comments like `//leave this here it works somehow` – sab669 Sep 17 '15 at 14:23
Sure, but the difference being that `ááá` and `ééé` don't yield the same value here. Unicode normalization forms are described [here](http://unicode.org/faq/normalization.html). Broadly speaking, they're used to compare canonical representations of strings, such that strings like `Aimée` and `Aimee` could be compared meaningfully. FormD does this by removing accents and diacritics. I don't quite think this covers OPs needs, but you might well choose to do this in combination with @sstan 's answer so that less information is lost when you do your string cleaning. – spender Sep 17 '15 at 14:28
@sab669 A unicode non-spacing mark is one that modifies the current character rather then one the sits in it's own space. In "typing" terms, the carriage doesn't advance when a non-spacing character is encountered. Stripping `SpacingCombiningMark` chars might be an idea too... – spender Sep 17 '15 at 14:36

Ignore accented letters while filtering a string in C#

2 Answers2