1

I have this regex:

var reg = /^(\w+ ?)+&(\w+ ?)+$/;

and this string:

var string = "number number number number number number number&";

When I execute:

reg.test(string);

this causes an unresponsive script message. I think this is due to catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html) but I can't find another way to write my regex. It simply must match two groups of any number of lowercase words separated by a space character, separated by a '&' character.

Some examples:

reg.test("string&number") //returns true
reg.test("string number&number") //returns true
reg.test("string number&string date") //returns true
reg.test("&string") //returns false
reg.test("string number&") //returns false
Casey Rule
  • 2,085
  • 2
  • 18
  • 28
  • 1
    Make the space mandatory? – Bergi Nov 26 '14 at 19:31
  • 1
    Yes, that's sort of on the right track. The space being optional is what is causing the ambiguity in the regular expression. The solution is to break the first subexpression into two, one of which requires the space. – Casey Rule Nov 26 '14 at 19:35

1 Answers1

3

If you structure it like this, it removes the catastrophic backtracking:

var reg = /^(\w+)( \w+)*&(\w+)( \w+)*$/;

The key is changing (\w+ ?)+ to (\w+)( \w+)*, so that there is only one way for the string to be parsed for that regular expression. To see the problem in your original expression, let's take a look at the string "number number" and the expression /(\w+ ?)+/. Your intention is to have the string broken into "number " and "number" by this regular expression, however, /(\w+ ?)/ can also match each letter individually, which makes your expression ambiguous. On the other hand, /(\w+)( \w+)*/ can only match "number number" in one way, as a space is required at the beginning of each iteration of the second subpattern.

Test it out here:

function testRegExp() {
  var rStr = '';
  try {
    var reg = new RegExp($('#r').val());
    rStr += reg.test( $('#t').val() );
  } catch( e ) {
     rStr = 'regex error';
  }
  $('#result').val( rStr );
}


$('input').on( 'input', testRegExp );

testRegExp();
label {
  width: 80px;
  display: inline-block;
}
input#t, input#r {
  width: 50%;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<p>
<label>String: </label>
  <input id="t" type="text" value="number number number number number number number&" /></p>
<p><label>RegExp: </label>
  <input id="r" type="text" value="^(\w+)( \w+)*&(\w+)( \w+)*$" /></p>
<p>
  <label>String: </label>
  <input id="result" type="text" disabled="disabled" />
</p>

Edit: Alan Moore and nhahtdh pointed out in the comments that this expression can actually be slightly simplified like this:

var reg = /^(\w+[ &])+(\w+( |$))+$/
Casey Rule
  • 2,085
  • 2
  • 18
  • 28
  • Nice one! Note that you can get rid of the optional space in the last group, too: `(\w+(?: |$))+$` – Alan Moore Nov 26 '14 at 19:35
  • What's the purpose of `(?: |$)` rather than `( |$)`? Also, do you really need the `$` at the end in this case? – Casey Rule Nov 26 '14 at 19:52
  • `( |&)` can be changed to `[ &]`. Character class typically performs between than alternation in absence of optimization. And you missed `$` at the very end (which is present in Alan Moore's comment). The `$` is important at the end, since nothing says that the repetition would terminate with matching `\w+$`. – nhahtdh Nov 27 '14 at 02:40
  • Replacing ( |&) with [ &] is a good move. Thanks for the suggestion. I'll fix that in my answer – Casey Rule Nov 27 '14 at 04:20
  • `(?:..)` is a non-capturing group. You should always use those in preference to the capturing variety unless you really want to capture something. As for the final anchor (`$`), yes, you still need it; the one I added doesn't anchor the whole match. Without it, you could match a string like `number number&number number !@#$%^`. It won't *consume* the junk characters, but their presence should cause the match to fail, and it doesn't. – Alan Moore Nov 27 '14 at 04:43
  • Do non-capturing groups perform better, or are we just assuming that OP just wants to capture the words and not the spaces? – Casey Rule Nov 27 '14 at 05:18