1

There seems to be a bug with C# regexes. In particular, the regex "[ -_]" seems to match capital letters. Anyone know if this is indeed a bug? It certainly seems so to me.

Buggy Code

using System;

public class Program
{
    public static void Main()
    {
        Console.WriteLine(System.Text.RegularExpressions.Regex.Replace("Aa-_", "[ -_]", "x"));
    }
}

Output: xaxx Expected: Aaxx

Non Buggy Code

using System;

public class Program
{
    public static void Main()
    {
        Console.WriteLine(System.Text.RegularExpressions.Regex.Replace("Aa-_", "[ _-]", "x"));
    }
}

Output = Expected: Aaxx

Notes

I used https://dotnetfiddle.net/ to evaluate my expressions. I got the same results as my local VS.

Colm Bhandal
  • 3,343
  • 2
  • 18
  • 29
  • 1
    To match space, `_` or `-`, use `@"[ \-_]"` or `@"[- _]"` or `@"[ _-]"` – Wiktor Stribiżew May 09 '19 at 08:29
  • @WiktorStribiżew thanks I was using your suggestion #3. But why do you need to escape the dash when it precedes the underscore, and not when it precedes the space or occurs as the last char in the regex? Does "-_" mean something special? – Colm Bhandal May 09 '19 at 08:32
  • 1
    `-` inside character classes is used to define ranges of chars in Unicode table. – Wiktor Stribiżew May 09 '19 at 08:33
  • Aaaaaah. I see the explanation from https://stackoverflow.com/users/1968/konrad-rudolph. Thanks @WiktorStribiżew! – Colm Bhandal May 09 '19 at 08:34
  • Yeah, [that one](https://stackoverflow.com/a/4068725/3832970) is quite self-explanatory. – Wiktor Stribiżew May 09 '19 at 08:40
  • Not a bug. `[ -_]` is semantically equivalent to `[\x20-\x5F]` - which captures all characters in ASCII range 0x20 till 0x5f. This includes space, spec. characters, brackets, +,-,numbers, comparison operators, capital letters, etc., etc., until it reaches underscore symbol in ascii table. If you wanted to capture your 3 specific characters, you should use `[\h\-_]` expression instead. Notice "\" escape symbol – Agnius Vasiliauskas May 09 '19 at 08:47
  • Thanks @AgniusVasiliauskas. I didn't have any coffee this morning and completely forgot the special status of the "-" character in regex. My brain was convinced the space was the problem as moving it seemed to have an effect, but it was in fact the dash in between that was resulting in differing behaviours. Aside: I wonder if it was a clever design decision of the language developers to allow "-" to be used without escaping. Perhaps it would be more intuitive if you always had to escape the -... – Colm Bhandal May 09 '19 at 11:18
  • @ColmBhandal You can't always escape `-`, because it is _a character range operator_ in PCRE. Like `[A-Z]` - matches any char from A..Z character range. In other times you just want to match the `-` character itself - just then we will use escape operator `[A\-Z]` - matches A, Z, - chars. Compiler can't guess your intentions - have you wanted to match a plain "-" character or a character range ? Compiling isn't guessing, so you must know language syntax very well and use it according specifications. BTW, in many languages same character/word has several meanings, what's why you need a context – Agnius Vasiliauskas May 09 '19 at 11:38
  • @AgniusVasiliauskas I may have been unclear: what I meant to say was, why aren't we always forced to escape the "-" character when using it as just the "-" character. Allowing the unescaped "-" character to sometimes be interpreted as a dash and sometimes as a special operator is confusing! – Colm Bhandal May 09 '19 at 12:42
  • @ColmBhandal _why aren't we always forced to escape the "-" character when using it as just the "-" character_, because operators have a higher precedence than literals, when parsing program's abstract syntax tree. And `-` is an operator when used in `[x-y]` form. _Allowing the unescaped "-" character to sometimes be interpreted as a dash and sometimes as a special operator is confusing_ Every tool is confusing if you don't know how to use it. Grab a book on compiler construction and read it - if you want a deep internals of programming language development – Agnius Vasiliauskas May 09 '19 at 15:51
  • @ColmBhandal ... And sorry for a joke, but compiler can in principle escape **whole your program** source as one big string literal. Is this what you wanted ? Probably, no. I may have used [Ad-absurdum](https://en.wikipedia.org/wiki/Reductio_ad_absurdum) logical fallacy here, but it was just as a joke and helping you to understand the point and how compilers works. – Agnius Vasiliauskas May 09 '19 at 16:09
  • @AgniusVasiliauskas it's an interesting point of view. Only a test over thousands of users would tell what is truly intuitive. – Colm Bhandal May 09 '19 at 16:41
  • @ColmBhandal You are only partly right: 1. Programming language must be **consistent**. What is consistent and what not - best knows only a language designer, because users don't have a compiler domain knowledge 2. Definition of "intuitive" may be different between different user groups 3. "Intuitive" definition can change over time even in the same user group – Agnius Vasiliauskas May 10 '19 at 06:28
  • For example - some users would like to always escape `-` and some would like to always NOT escape dash. From the programming language designer's point of view is best to assume that user is clever and that he/she wanted to write a complex and feature-rich program. This means, that compiler should try to maximize number of branches in an abstract syntax tree. Which means we need to maximize number of operators and reserved keywords in a source code. Otherwise there would be no point in programming at all, because in rare cases programs are simple (unless it's just a tutorial :-D). – Agnius Vasiliauskas May 10 '19 at 06:49

0 Answers0