6

I'm trying to write a string 'clean-up' function that allows only alphanumeric characters, plus a few others, such as the underscore, period and the minus (dash) character.

Currently our function uses straight char iteration of the source string, but I'm trying to convert it to RegEx because from what I've been reading, it is much cleaner and more performant (which seems backwards to me over a straight iteration, but I can't profile it until I get a working RegEx.)

The problem is two-fold for me. One, I know the following regex...

[a-zA-Z0-9]

...matches a range of alphanumeric characters, but how do I also include the underscore, period and the minus character? Do you simply escape them with the '\' character and put them between the brackets with the rest?

Second, for any character that isn't part of the match (i.e. other punctuation like '?') we would like it replaced with an underscore.

My thinking is to instead match on a range of desired characters, we match on a single character that's not in the desired range, then replace that. I think the RegEx for that is to include the carat as the first character between the brackets like this...

[^a-zA-Z0-9]

Is that the correct approach?

Mark A. Donohoe
  • 28,442
  • 25
  • 137
  • 286

4 Answers4

7

Probably the most efficient way to do this is to set up a static Regex that describes the characters that you want to replace.

public static class StringCleaner
{    
    public static Regex invalidChars = new Regex(@"[^A-Z0-9._\-]", RegexOptions.Compiled | RegexOptions.IgnoreCase);

    public static string ReplaceInvalidChars(string input)
    {
        return invalidChars.Replace(input, "_");
    }
}

However, if you don't want the Regex to replace line ends and whitespace (like spaces and tabs) you'll need to use a slightly different expression.

public static Regex invalidChars = new Regex(@"[^A-Z0-9._\-\s]", RegexOptions.Compiled | RegexOptions.IgnoreCase);

Also, here are the rules for what you must escape to match the literal character:

Inside a set denoted by square brackets you must escape these characters -#]\ anywhere they occur and ^ only if it appears in the first position of the set to match the literal characters. Outside of a set you must escape these characters: .$^|{}[]()+?# to match the literal character.

See the following documentation for more information:

JamieSee
  • 12,696
  • 2
  • 31
  • 47
  • Actually, I *do* want it to replace all whitespace and line endings (and line beginnings!). They aren't valid characters so your first one is correct. However, correct me if I'm wrong, but you're starting your literal strings with the '@' character, which to me looks like Objective C, not C#. ...or am I missing something? – Mark A. Donohoe Jul 09 '13 at 19:43
  • I'm missing something! :) I now know starting a string in C# basically escapes the entire string for you. I like it! You get the accepted answer for your completeness. Thanks! :) – Mark A. Donohoe Jul 09 '13 at 19:48
  • 1
    Yes, starting a string with @" in C# makes it literal. Here's the part of the language spec that explains it: http://msdn.microsoft.com/en-us/library/aa691090(v=VS.71).aspx – JamieSee Jul 09 '13 at 21:48
3

If you are trying to remove characters that you don't want, you'd be better served by Regex.Replace:

string cleaned = Regex.Replace(input, "[^a-zA-Z0-9_.]|-", "_");

To include the '-' character you can just use the Regex OR to include that character, although there probably is a way to include it in the character class, it's escaping me at the moment.

Edit: You don't actually need to explicitly include the hyphen, because it doesn't match the class anyway. That is, if you want to replace hyphen with underscore, just use [^a-zA-Z0-9_.] as your class... anything that doesn't match those classes will get replaced. But the correct way to include a hyphen in a class is to escape it with backslash (\-) or you can put it at the begging of the class list: [^-a-zA-Z0-9_.].

Michael Bray
  • 14,998
  • 7
  • 42
  • 68
0

I think it would be perfect to use the Replace method of the string.

public string StringClean(string source, char replacement, char[] targets)
{
  foreach(char c in targets)
  {
  //...
  }
}

(Not in VS so maybe not perfect code)

Péter
  • 2,161
  • 3
  • 21
  • 30
0

If you need to replace all characters that are not on your described pattern with an underscore do this:

string result = Regex.Replace(YourOriginalString, "[^a-zA-Z0-9_.-]", "_");
terrybozzio
  • 4,424
  • 1
  • 19
  • 25