5

I have a problem with the following regular expression:

var s = "http://www.google.com/dir/file\r\nhello"
var re = new RegExp("http://([^/]+).*/([^/\r\n]+)$");
var arr = re.exec(s);
alert(arr[2]);

Above, I expect arr[2] (i.e. capture group 2) to be "file", matching against the last 4 character in the first line after applying a greedy .*, backtracking due to / in the pattern, and then anchoring against the end of line by $.

In fact, arr[] is null, which implies that the pattern did not even match.

I can alter this slightly so it does precisely what I intend:

var s = "http://www.google.com/dir/file\r\nhello"
var re = new RegExp("http://([^/]+).*/([^/\r\n]+)[\r\n]*");
var arr = re.exec(s);
alert(arr[2]); // "file", as expected

My question is not so how much HOW to grab "file" from the end of the first line in s. Instead, I'm trying to understand WHY the first regexp fails and the second succeeds. Why does $ not match against the \r\n line break in example 1? Isn't that the sole purpose of its existence? Is there something else I'm missing?

Also, consider the same first regular expression as used in sed (with extended regular expression mode enabled with -r):

$ echo -e "http://www.google.com/dir/file\r\nhello" |sed -r  -e 's#http://([^/]+).*/([^/\r\n]+)$#\2.OUTSIDE.OF.CAPTURE.GROUP#'
<<OUTPUT>>
file.OUTSIDE.OF.CAPTURE.GROUP
hello

Here, capture group 2 captures "file" and nothing else. "hello" appears in the output, but does not exist inside the capture group, which is proven by the position of string ".OUTSIDE.OF.CAPTURE.GROUP" in the output. So the regular expression works according to my understanding in sed, but not using the built in Javascript regexp engine.

If I replace \r\n in the input string with just \n, the behavior is identical for all three above examples, so that should not be relevant as far as I can tell.

  • you forget to escape the `/` see it here: https://regex101.com/r/cV1nJ0/1 – Jorge Campos Oct 05 '15 at 22:49
  • 1
    Jorge: I'm afraid that's not it. As you can see in your link, that captures "file\r\nhello" for the second capture group, while I'm trying to capture just "file". / shouldn't be considered a delimiter when used in RegExp("...") as far as I can tell, nor in the sed script, where # is the delimiter. Thanks anyway though. – jrsanderson Oct 05 '15 at 22:56

1 Answers1

5

You need to enable regex multiline mode to match end of line characters

var re = new RegExp("http://([^/]+).*/([^/\r\n]+)$", "m");

http://javascript.info/tutorial/ahchors-and-multiline-mode

enter image description here

Maksym Kozlenko
  • 10,273
  • 2
  • 66
  • 55