1

In the following block of code,the letters accented are not recognized (i fall into the "else")

           StringBuilder sb = new StringBuilder();
           foreach (char c in str) {
              if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_') {
                 sb.Append(c);
              }
              else
              {
               // if c is accented, i arrive here
              }

What can i do to ignore accents? thanks for your help

FieryA
  • 297
  • 1
  • 7
  • 22

2 Answers2

8

Consider using char.IsLetterOrDigit(c).

Indicates whether the specified Unicode character is categorized as a letter or a decimal digit.

if (char.IsLetterOrDigit(c) || c == '.' || c == '_') {
    sb.Append(c);
}

The functions returns true for any letter, including accented ones.

sstan
  • 35,425
  • 6
  • 48
  • 66
2

How about just cleaning up the strings by removing accents and diacritics?

public string RemoveAccentsAndDiacritics(string s)
{
    return string.Concat(
        s.Normalize(NormalizationForm.FormD)
         .Where(c => System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) !=
                     System.Globalization.UnicodeCategory.NonSpacingMark));
}
spender
  • 117,338
  • 33
  • 229
  • 351
  • Could you explain what this does? I'm not a *great* .NET developer but I've been using it for a good while. I'd have to go and look up almost everything here; what is `Normalize` or `NormalizationForm` or what exactly is included in `NonSpacingMark`? Compared to sstan's answer this seems remarkably over-engineered and is not very readable. Don't get me wrong, I'm not afraid to go and read documentation on MSDN but for an answer this leaves a lot of guess work up to the question askwer. This is the kind of answer that has a dev writing comments like `//leave this here it works somehow` – sab669 Sep 17 '15 at 14:23
  • Sure, but the difference being that `ááá` and `ééé` don't yield the same value here. Unicode normalization forms are described [here](http://unicode.org/faq/normalization.html). Broadly speaking, they're used to compare canonical representations of strings, such that strings like `Aimée` and `Aimee` could be compared meaningfully. FormD does this by removing accents and diacritics. I don't quite think this covers OPs needs, but you might well choose to do this in combination with @sstan 's answer so that less information is lost when you do your string cleaning. – spender Sep 17 '15 at 14:28
  • @sab669 A unicode non-spacing mark is one that modifies the current character rather then one the sits in it's own space. In "typing" terms, the carriage doesn't advance when a non-spacing character is encountered. Stripping `SpacingCombiningMark` chars might be an idea too... – spender Sep 17 '15 at 14:36