0

I'm building a developer tool, and in one input field my users can input regular expressions.

If they enter an expression that tries to match a literal ? character anywhere then they've probably made a mistake, as I know that ? specifically is guaranteed to never appear in the string to match (and if they're trying to spot one, then there's a different action they should take instead). I would like to show a warning in that case.

How can I quickly check from a string containing a regular expression whether it contains a literal ? character? E.g. I want to warn about regular expression strings like hello\?, but not https?.

Detecting \? is probably a good start, but I imagine there's other cases too.

I'm building this in JavaScript. Solutions based on simple string processing are preferable to fully parsing the regular expression, if possible.

Tim Perry
  • 11,766
  • 1
  • 57
  • 85
  • 2
    You will also need to make sure you are not matching a `?` in a non-capturing group, or a lookahead. You probably want to find an escaped `?`, something like `/\\(?:\\{2})*\?/`. However, when a `?` is used inside a character class, like `[a-z?#$]`, it will become much trickier. – Wiktor Stribiżew Aug 12 '19 at 11:26
  • Don't forget character classes `[?]` – phuzi Aug 12 '19 at 11:28
  • If your regular expression is delimited using quotes you might want to look for a \\? instead, https://stackoverflow.com/questions/889957/escaping-question-mark-in-regex-javascript – aadibajpai Aug 12 '19 at 11:28
  • 1
    You want to check for `\?` - escaped question mark but not `\\?` - optional escaped backslash. Also, you want to check for `[?]` - a question mark in a character class. But not `\[a?\]` - square brackets optionally containing `a`. Other than that, there is also the character code for a question mark, I guess. – VLAZ Aug 12 '19 at 11:30
  • Ok, so it sounds like I need to spot escaped literals + any usage inside character classes, for `?` or its character codes (`\u003F` and `\x3F`). Any other cases? Also, since I need to find character classes, how do I reliably spot those? Can I simply check between unescaped `[` & `]` characters? – Tim Perry Aug 12 '19 at 11:38
  • @TimPerry I think that has you covered. The trickiest part is taking into account escaped special cases - `\\?` or `\[?\]`. Everything else should be straight forward. Unfortunately, I can't personally write or test a regex for this at the moment. – VLAZ Aug 12 '19 at 11:41

1 Answers1

1

Consider using an existing Regular Expression parser which outputs an AST.

For example for JavaScript:
https://www.npmjs.com/package/regjsparser
https://github.com/jviereck/regjsparser

The demo page here allows you to see the generated AST:
http://www.julianviereck.de/regjsparser/

Then you could look through the "codePoint" (63) in the AST:

{
      "type": "value",
      "kind": "identifier",
      "codePoint": 63,
      "range": [
        15,
        17
      ],
      "raw": "\\?"
    }

Also note that "characterClassRange" types might also include your "?" character in it's range, the following includes a range of characters including "?" (63): http://www.julianviereck.de/regjsparser/#%2F%5B%5Cu003e-%5Cu0040%5D%2Fiu

You could check the "codePoint" range between min and max for your character.

{
      "type": "characterClassRange",
      "min": {
        "type": "value",
        "kind": "unicodeEscape",
        "codePoint": 62,
        "range": [
          1,
          7
        ],
        "raw": "\\u003e"
      },
      "max": {
        "type": "value",
        "kind": "unicodeEscape",
        "codePoint": 64,
        "range": [
          8,
          14
        ],
        "raw": "\\u0040"
      },
      "range": [
        1,
        14
      ],
      "raw": "\\u003e-\\u0040"
    }

Obviously check other test cases for other "types" that might include your character, but generally using an AST to perform these checks will improve how you "catch" them ("Gotta Catch 'Em All").

Also note there is a JS library to generate regular expressions from the AST:
https://www.npmjs.com/package/regjsgen
https://github.com/bnjmnt4n/regjsgen

Dean Taylor
  • 40,514
  • 3
  • 31
  • 50