3

Is it possible to output the first character from a string (its index) that causes a mismatch with a regular expression? Is it possible with just using regular expression matching operations or something more complex must be employed?

For instance, in JavaScript, I may have a regular expression /^\d{3}\s\d{2}$/ that matches string with 3 digits followed by a whitespace and another 2 digits. I have a string "123a45" to which I apply this regular expression. Doing this (e.g., "123a45".match(/^\d{3}\s\d{2}$/)) returns null since the regular expression is not matched. How can I get the first character that causes this mismatch (in this case "a", the character with the index 3)?

One use case for this could be to point user directly to the character that causes a string entered by the user to be invalid according to some regular expression used for its validation.

Jindřich Mynarz
  • 1,563
  • 1
  • 16
  • 31
  • 4
    It is very likely that the generalized problem is not doable (one example is regex with OR `|`, you don't know which character actually causes the problem since there are 2 possible cases). For some specific problem, it may be possible to some extent. – nhahtdh Jul 21 '12 at 20:02
  • Could you somehow identify which part of a regex with `|` was used when the match failed? – Jindřich Mynarz Jul 21 '12 at 20:17
  • For example: `([A-Z][0-9]{2}|[0-9]{2}[A-Z])` and input `090`, which character will you report? The first `0` or the last `0`? And the match fails because it fails all the cases specified by the `OR`, and the cause can be varied, as pointed out. – nhahtdh Jul 21 '12 at 20:20
  • Yes, the simple identification of the first character causing the mismatch doesn't work it that ambiguous case. – Jindřich Mynarz Jul 21 '12 at 20:58

2 Answers2

3

You would need to break-down the regex pattern to all possible matching patterns for partial matches and such list of patterns ordered from the longest match to the shortest one (or none). Once you got match, calculating the lenght of (partial) match you'll get position of the character that causes mismatch. Substring from that position with length of one character is exactly character that is behind this mismatch (if some). If there is no mismatch, then it returns empty (sub-)string.

var s = "123a45";
alert(s.substr(s.match(/^(\d{3}\s\d{2}|\d{3}\s\d|\d{3}\s|\d{0,3})/)[1].length,1));

http://jsfiddle.net/ETWWS/

Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • 1
    Care to add a short explanation? – Octavian Helm Jul 21 '12 at 19:58
  • Do you think this is a generic solution that may be applied to all kinds of regexes? Do you think you can generate the "expanded" regex containing all the partially matching regexes automatically? – Jindřich Mynarz Jul 21 '12 at 20:13
  • 1
    @jindrichm - Every fixed length non-alternative pattern can be break-down. So the following examples cannot: (1) `\d{2,4}:\d{1,2}`, (2) `(\w{3}\d{2}|\d{3}\w{2})`, etc. – Ωmega Jul 21 '12 at 20:16
  • Yes, I see that most of the regexes can't be treated this way. I think it's probably better to do what @nhahtdh proposed - write a custom parser. – Jindřich Mynarz Jul 21 '12 at 20:19
  • 1
    @jindrichm - Yes, parser is safer solution. I posted my regex solution because you asked for regex, so... But I have to correct myself - alternative pattern can be break-down, so the issue is variable length (sub-)patterns, so using `{m,n}` or `+` or `*` or `?` is what you need to avoid, if you want to go with regex. So for example regex for phone number `\d{3}[-.]\d{3}[-.]\d{4}` is okay, even there is alternative separator dash or period/dot. Well, you have to decide... Good luck! (Vela štestí a všecho dobré, Jindřichu!) – Ωmega Jul 21 '12 at 20:35
2

To provide detailed explanation on why the input is invalid, it is better to write a small parser and provide feedback instead. It is possible to point user to the character that is causing problem, and give more helpful and targeted error message.

In the parser, you may use regex to assert certain property in the string to generate targeted error message. For example, if the input must contain 6 character, and the first 3 characters are number, and the last 3 are alphabet characters, then you can write a regex to assert the length of input to report the error to the user.

Either that, or just use regex you have been using and provide a generic error message (with helpful instruction on how to enter correctly). A normal user should be able to enter the data correctly in at most 2-3 tries. Above that, it may be malicious user, or the data to be entered is not applicable to all user, or your instruction is lacking.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162