* quantifier in Perl 6

Question

This seems to be something very basic that I don't understand here.

Why doesn't "babc" match / a * / ?

> "abc" ~~ / a /
｢a｣
> "abc" ~~ / a * /
｢a｣
> "babc" ~~ / a * /
｢｣                    # WHY?
> "babc" ~~ / a + /
｢a｣

Did you mean `/a*/` and it just came out as `/ a * /` because I'm pretty sure the spaces in a regex matter so the `*` is on the ` ` not the `a`.... — Sled, Dec 07 '18 at 21:39
@Artb No they don't matter, and `/a*/` gives just the same result. — Eugene Barsky, Dec 07 '18 at 21:44

Pavlo Bashynskyi · Answer 1 · 2018-12-07T23:05:06.810

8

Because * quantifier makes the preceding atom match zero or more times.

｢｣ is first match of / a * / in any string. For example:

say "xabc" ~~ / a * . /; # OUTPUT: ｢x｣

it's same:

say "xabc" ~~ / (a+)? . /;

If you set the pattern more precise, you will get another result:

say "xabc" ~~ / x a * /; # OUTPUT: ｢xa｣
say "xabc" ~~ / a * b /; # OUTPUT: ｢ab｣

edited Dec 07 '18 at 23:05

answered Dec 07 '18 at 22:56

Pavlo Bashynskyi

329
2
5

1

That makes sense -- but then why does `"abc" ~~ / a * /` not give the same result? – Keith Thompson Dec 08 '18 at 00:39
1

@KeithThompson An `a*` pattern (not `a*?`) will match one or more consecutive `a`s -- as many as it can. Give it *zero* `a`s to match and it'll *still* match as many as it can (zero). Either way, it matches and the regex is done unless there's more of the pattern after the `a*`. If there's more pattern then it tries that too. If *that* matches then the regex is done. If not, the regex backtracks and tries again, matching one less `a` to see if that works out. If that fails, it backtracks and matches one less `a`, etc. – raiph Dec 08 '18 at 01:01
@KeithThompson Similarly, an `a*?` pattern will frugally match one or more consecutive `a`s -- as **few** as it can. It's even happier to match *zero* `a`s than the greedy `a*` without a `?` on the end. But if there's more to the regex pattern, then it continues. If the next bit fails then the engine backtracks and tries matching one **more** `a` rather than one less, and then tries the remainder of the pattern. If it still fails, then it backtracks again etc. (This backtracking behavior only applies if it's a `regex` pattern, not a `token` or `rule`. A `/.../` literal is a `regex`.) – raiph Dec 08 '18 at 01:11
@raiph Surely you mean zero or more consecutive `a`s when using `a*` – one or more is `a+`. A frugal match such as `a*?` is guaranteed to provide a zero length match regardless of the input. `say "cccc" ~~ /a*?/ # output: ｢｣` – donaldh Dec 10 '18 at 12:23
@donaldh You're right about `a*` and `a+` of course. A silly mistake on my part. – raiph Dec 10 '18 at 15:12
@donaldh There must be some misunderstanding about about `a*?`. Perhaps I didn't make myself clear about the distinction between a subpattern and an overall regex or that I meant to speak generically about `*` and `*?` without regard for the atom they quantify. Note that your example is the same whether one uses `a*` or `a*?`. Here's a simple counterexample to one literal reading of your comment: `say "1aa2" ~~ /1 a*? 2/ # output: ｢1aa2｣`. But again, that's the same for `a*` and `a*?` Next, consider `say "1aaabab" ~~ /1 [a|b]* <(b.*)>/ # ｢b｣` vs `say "1aaabab" ~~ /1 [a|b]*? <(b.*)>/ # ｢bab｣`. – raiph Dec 10 '18 at 15:41

score 7 · Accepted Answer · answered Dec 08 '18 at 07:52

7

The answers here are correct, I'll just try to present them in a more coherent form:

Matching always starts from the left

The regex engine always starts at the left of the strings, and prefers left-most matches over longer matches

`*` matches empty strings

The regex a* matches can match the strings '', 'a', 'aa' etc. It will always prefer the longest match it finds, but it can't find a match longer than the empty string, it'll just match the empty string.

Putting it together

In 'abc' ~~ /a*/, the regex engine starts at position 0, the a* matches as many a's as it can, and thus matches the first character.

In 'babc' ~~ /a*/, the regex engine starts at position 0, and the a* can match only zero characters. It does so successfully. Since the overall match succeeds, there is no reason to try again at position 1.

answered Dec 08 '18 at 07:52

moritz

12,710
1
41
63

Thanks! The only question is why it's made different from standard bash `grep` (where `echo babc | grep 'a*'` will match)? – Eugene Barsky Dec 08 '18 at 08:02
The regex does match, it just matches the empty string. If you do a `say so 'babc' ~~ /a*/`, it says True. – moritz Dec 08 '18 at 08:37
I meant that traditional grep will match a non-empty string (`a` in this case). So what's the idea behind making it different in Perl 6? – Eugene Barsky Dec 08 '18 at 09:58
4

@EugeneBarsky Presumably the regex engine underlying `grep` is matching empty strings too but `grep` chooses not to display them. Additionally, `grep` is presumably defaulting to multiple matching. So `echo babaac | grep 'a*'`, which displays something like "b**a**b**aa**c", corresponds to `say "babaac" ~~ m:g/ a * /`, which displays `(｢｣｢a｣｢｣｢aa｣｢｣｢｣)`. – raiph Dec 08 '18 at 12:06
`echo bbbb | grep 'a*'` also matches so grep's behaviour is the same. Grep is showing the lines that match, not the matches themselves. – donaldh Dec 08 '18 at 18:41
@donaldh It normally shows the matches, either with color or using `-o`. – Eugene Barsky Dec 08 '18 at 23:15
@EugeneBarsky – hmm. When I try grep with -o it shows the zero length match. Running `echo babc | grep -o 'a*'` prints a blank line, i.e. the line matches and -o provides the matching part of the line which is zero length. If you try it with --color=always it will print `babc` with nothing highlighted. Try it with `ababc` and it prints `ababc` with the first `a` highlighted. There's no difference in behaviour between grep and Perl 6 for `a*`. – donaldh Dec 10 '18 at 15:25
@donaldh What's your system? for `echo babc | grep -o 'a*'` I get `a` in output. – Eugene Barsky Dec 10 '18 at 16:19
@EugeneBarsky macOS running grep (BSD grep) 2.5.1-FreeBSD. Interestingly, I have a Linux box with grep (GNU grep) 2.5.1 which exhibits the same behaviour. But a newer Linux box with grep (GNU grep) 2.20 gives the behaviour you describe. The man page on that machine says `Print only the matched (non-empty) parts of a matching line`. – donaldh Dec 10 '18 at 17:17
@EugeneBarsky on that machine, `echo bbbb | grep 'a*'` prints `bbbb`. It turns out that since 2005 gnu grep forces non empty matches when using -o or --color so it's behaviour changes if you ask to see the matches – http://git.savannah.gnu.org/cgit/grep.git/commit/?id=268c542052401bc3f3b9d2e006a254da5e3bac28 – donaldh Dec 10 '18 at 17:31
@donaldh I've just tested both my Mac (gnu grep 2.21) and Linux (gnu grep 3.1), and they both give `a`. So that seems to be really some ancient behaviour. :) producing `bbbb` seems logical to me as well. – Eugene Barsky Dec 10 '18 at 18:23
GNU grep uses a regex engine called DFA (Deterministic Finite Automaton), which is different from the engine used by Perl (and I guess, Perl6): NFA (Nondeterministic Finite Automaton). The difference is in how the engine and the text work: the NFA engine is regex-driven, so the regex is matched against the text (and /a*/ matches against the beginning of the string). The DFA engine is test-driven, so the text is examined one character at the time and matched against the regex. – Fernando Santagata Dec 12 '18 at 16:06

* quantifier in Perl 6

2 Answers2

Matching always starts from the left

* matches empty strings

Putting it together

`*` matches empty strings