5

I just did the funny regex crosswords at http://regexcrossword.com/ - and found out I don't understand what quantifying groups means, e.g. (.)+ or (.)*

Let me try at http://ole.michelsen.dk/tools/regex.html , it offers the JavaScript and the PHP regex engine:

The string to match against is "Trololo!" (without quotation marks). (If switching on "Global match" changed something, it is added as primed version, that is JS', as it didn't change anything in PHP mode.)

JS,  (.)+ => 0: Trololo! 1: ! 
JS', (.)+ => 0: Trololo! 
PHP, (.)+ => 0: Trololo! 0: ! 
JS,  (.)* => 0: Trololo! 1: ! 
JS', (.)* => 0: Trololo! 
PHP, (.)* => 0: Trololo! 1: 0: ! 1: 
JS,  (.){5} => 0: Trolo 1: o 
JS', (.){5} => 0: Trolo 
PHP, (.){5} => 0: Trolo 0: o 
JS,  (.){4} => 0: Trol 1: l 
JS', (.){4} => 0: Trol 1: olo! 
PHP, (.){4} => 0: Trol 1: olo! 0: l 1: ! 

Is there any normative answer what the semantics of this is?

ekad
  • 14,436
  • 26
  • 44
  • 46
Falko
  • 1,028
  • 1
  • 12
  • 24

1 Answers1

3

The outputs aren't labelled correctly, that's all.

First of all, what should happen? If you repeat a group, each new instance overwrites the last capture. If the group isn't used at all it will return an empty string or something like undefined in JS (it depends on the flavor). There is a good article over on regular-expressions.info on the matter.

Now how do we get to your results? Let's start with JavaScript.

All the examples labelled JS (the non-global ones) fit the above description. They match the desired amount of characters in 0 and capture the last character in 1. So we can ignore these.

What's with the global ones? Here the output was interpreted incorrectly. When you use the global flag with the String.match() function, you don't get an array of all captures any more - but only an array of all matches (group 0 for each match). Hence, in the case of +, * and {5} where there is only one match, you only get that one result. With {4} there is enough room for two matches in the target string, so the resulting array contains two elements. To get all captures with the global flag, you'd need to write a loop and use RegExp.exec() instead (which gives you one match at a time, but all its captures).

And what's with PHP? It seems that it's using preg_match_all, which is global anyway, which is why using g had no effect. The + gives the result you'd expect again. So does {5}.

What's with the other two? Here, the output has been interpreted the wrong way round. By default, preg_match_all gives a two dimensional array, where the first index corresponds to the group, and the second one corresponds to the match. In your output, it's interpreted the other way round. Hence, when there are multiple matches, the first pair of 0 and 1 are the entire match of two found matches. The second pair 0 and 1 are what you captured in those two matches.

So for *, you first get the full string as a match, and the last character as the capture (the two things labelled 0), which is correct. And then, since * allows zero-width matches, you get another (empty) match at the end of the string, along with an empty capture. I'm not sure why the corresponding JS' example does not contain an additional empty string, though, because String.match would do the same thing.

And for {4}, you just get two matches (Trol and olo!) as in the JavaScript case with the captures l and !, respectively, which is again perfectly fine.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • > First of all, what should happen? Exactly, that's the question :-) I actually expected, that matching (.)* on abcd would give 0. abcd 1. a 2. b 3. c 4. d - that is, I get more groups, as the quantifyer seems to quantify the group. But thinking about it longer, that seems not usable at all, as the group references are fix. – Falko Jul 30 '13 at 07:09
  • 1
    @Falko, the only regex flavor which does that is .NET. There you get a Group object for each group which contains a collection of Captures. – Martin Ender Jul 30 '13 at 07:37
  • Ah, cool. But as mentioned, I cannot imagine how this can be helpful in practice. – Falko Jul 30 '13 at 14:00
  • 1
    @Falko say you have something like `...(13|52|78|33)...` and want to match all numbers, but there could be an arbitrary amount. The alternative is a two-step matching, where you first match the sequence, and then split it or something. But in .NET, you can capture all those numbers right away. .NET actually goes further and those captures are saved on stacks, where elements can be popped agian during matching, which allows for things like counting in regex: [see balancing groups](stackoverflow.com/questions/17003799/what-are-regular-expression-balancing-groups) – Martin Ender Jul 31 '13 at 12:45