I'm trying to write a string 'clean-up' function that allows only alphanumeric characters, plus a few others, such as the underscore, period and the minus (dash) character.
Currently our function uses straight char iteration of the source string, but I'm trying to convert it to RegEx because from what I've been reading, it is much cleaner and more performant (which seems backwards to me over a straight iteration, but I can't profile it until I get a working RegEx.)
The problem is two-fold for me. One, I know the following regex...
[a-zA-Z0-9]
...matches a range of alphanumeric characters, but how do I also include the underscore, period and the minus character? Do you simply escape them with the '\' character and put them between the brackets with the rest?
Second, for any character that isn't part of the match (i.e. other punctuation like '?') we would like it replaced with an underscore.
My thinking is to instead match on a range of desired characters, we match on a single character that's not in the desired range, then replace that. I think the RegEx for that is to include the carat as the first character between the brackets like this...
[^a-zA-Z0-9]
Is that the correct approach?