5

This seems to be something very basic that I don't understand here.

Why doesn't "babc" match / a * / ?

> "abc" ~~ / a /
「a」
> "abc" ~~ / a * /
「a」
> "babc" ~~ / a * /
「」                    # WHY?
> "babc" ~~ / a + /
「a」
raiph
  • 31,607
  • 3
  • 62
  • 111
Eugene Barsky
  • 5,780
  • 3
  • 17
  • 40
  • Did you mean `/a*/` and it just came out as `/ a * /` because I'm pretty sure the spaces in a regex matter so the `*` is on the ` ` not the `a`.... – Sled Dec 07 '18 at 21:39
  • 2
    @Artb No they don't matter, and `/a*/` gives just the same result. – Eugene Barsky Dec 07 '18 at 21:44

2 Answers2

8

Because * quantifier makes the preceding atom match zero or more times.

「」 is first match of / a * / in any string. For example:

say "xabc" ~~ / a * . /; # OUTPUT: 「x」

it's same:

say "xabc" ~~ / (a+)? . /;

If you set the pattern more precise, you will get another result:

say "xabc" ~~ / x a * /; # OUTPUT: 「xa」
say "xabc" ~~ / a * b /; # OUTPUT: 「ab」
  • 1
    That makes sense -- but then why does `"abc" ~~ / a * /` not give the same result? – Keith Thompson Dec 08 '18 at 00:39
  • 1
    @KeithThompson An `a*` pattern (not `a*?`) will match one or more consecutive `a`s -- as many as it can. Give it *zero* `a`s to match and it'll *still* match as many as it can (zero). Either way, it matches and the regex is done unless there's more of the pattern after the `a*`. If there's more pattern then it tries that too. If *that* matches then the regex is done. If not, the regex backtracks and tries again, matching one less `a` to see if that works out. If that fails, it backtracks and matches one less `a`, etc. – raiph Dec 08 '18 at 01:01
  • @KeithThompson Similarly, an `a*?` pattern will frugally match one or more consecutive `a`s -- as **few** as it can. It's even happier to match *zero* `a`s than the greedy `a*` without a `?` on the end. But if there's more to the regex pattern, then it continues. If the next bit fails then the engine backtracks and tries matching one **more** `a` rather than one less, and then tries the remainder of the pattern. If it still fails, then it backtracks again etc. (This backtracking behavior only applies if it's a `regex` pattern, not a `token` or `rule`. A `/.../` literal is a `regex`.) – raiph Dec 08 '18 at 01:11
  • @raiph Surely you mean zero or more consecutive `a`s when using `a*` – one or more is `a+`. A frugal match such as `a*?` is guaranteed to provide a zero length match regardless of the input. `say "cccc" ~~ /a*?/ # output: 「」` – donaldh Dec 10 '18 at 12:23
  • @donaldh You're right about `a*` and `a+` of course. A silly mistake on my part. – raiph Dec 10 '18 at 15:12
  • @donaldh There must be some misunderstanding about about `a*?`. Perhaps I didn't make myself clear about the distinction between a subpattern and an overall regex or that I meant to speak generically about `*` and `*?` without regard for the atom they quantify. Note that your example is the same whether one uses `a*` or `a*?`. Here's a simple counterexample to one literal reading of your comment: `say "1aa2" ~~ /1 a*? 2/ # output: 「1aa2」`. But again, that's the same for `a*` and `a*?` Next, consider `say "1aaabab" ~~ /1 [a|b]* <(b.*)>/ # 「b」` vs `say "1aaabab" ~~ /1 [a|b]*? <(b.*)>/ # 「bab」`. – raiph Dec 10 '18 at 15:41
7

The answers here are correct, I'll just try to present them in a more coherent form:

Matching always starts from the left

The regex engine always starts at the left of the strings, and prefers left-most matches over longer matches

* matches empty strings

The regex a* matches can match the strings '', 'a', 'aa' etc. It will always prefer the longest match it finds, but it can't find a match longer than the empty string, it'll just match the empty string.

Putting it together

In 'abc' ~~ /a*/, the regex engine starts at position 0, the a* matches as many a's as it can, and thus matches the first character.

In 'babc' ~~ /a*/, the regex engine starts at position 0, and the a* can match only zero characters. It does so successfully. Since the overall match succeeds, there is no reason to try again at position 1.

moritz
  • 12,710
  • 1
  • 41
  • 63
  • Thanks! The only question is why it's made different from standard bash `grep` (where `echo babc | grep 'a*'` will match)? – Eugene Barsky Dec 08 '18 at 08:02
  • The regex does match, it just matches the empty string. If you do a `say so 'babc' ~~ /a*/`, it says True. – moritz Dec 08 '18 at 08:37
  • I meant that traditional grep will match a non-empty string (`a` in this case). So what's the idea behind making it different in Perl 6? – Eugene Barsky Dec 08 '18 at 09:58
  • 4
    @EugeneBarsky Presumably the regex engine underlying `grep` is matching empty strings too but `grep` chooses not to display them. Additionally, `grep` is presumably defaulting to multiple matching. So `echo babaac | grep 'a*'`, which displays something like "b**a**b**aa**c", corresponds to `say "babaac" ~~ m:g/ a * /`, which displays `(「」 「a」 「」 「aa」 「」 「」)`. – raiph Dec 08 '18 at 12:06
  • `echo bbbb | grep 'a*'` also matches so grep's behaviour is the same. Grep is showing the lines that match, not the matches themselves. – donaldh Dec 08 '18 at 18:41
  • @donaldh It normally shows the matches, either with color or using `-o`. – Eugene Barsky Dec 08 '18 at 23:15
  • @EugeneBarsky – hmm. When I try grep with -o it shows the zero length match. Running `echo babc | grep -o 'a*'` prints a blank line, i.e. the line matches and -o provides the matching part of the line which is zero length. If you try it with --color=always it will print `babc` with nothing highlighted. Try it with `ababc` and it prints `ababc` with the first `a` highlighted. There's no difference in behaviour between grep and Perl 6 for `a*`. – donaldh Dec 10 '18 at 15:25
  • @donaldh What's your system? for `echo babc | grep -o 'a*'` I get `a` in output. – Eugene Barsky Dec 10 '18 at 16:19
  • @EugeneBarsky macOS running grep (BSD grep) 2.5.1-FreeBSD. Interestingly, I have a Linux box with grep (GNU grep) 2.5.1 which exhibits the same behaviour. But a newer Linux box with grep (GNU grep) 2.20 gives the behaviour you describe. The man page on that machine says `Print only the matched (non-empty) parts of a matching line`. – donaldh Dec 10 '18 at 17:17
  • @EugeneBarsky on that machine, `echo bbbb | grep 'a*'` prints `bbbb`. It turns out that since 2005 gnu grep forces non empty matches when using -o or --color so it's behaviour changes if you ask to see the matches – http://git.savannah.gnu.org/cgit/grep.git/commit/?id=268c542052401bc3f3b9d2e006a254da5e3bac28 – donaldh Dec 10 '18 at 17:31
  • @donaldh I've just tested both my Mac (gnu grep 2.21) and Linux (gnu grep 3.1), and they both give `a`. So that seems to be really some ancient behaviour. :) producing `bbbb` seems logical to me as well. – Eugene Barsky Dec 10 '18 at 18:23
  • GNU grep uses a regex engine called DFA (Deterministic Finite Automaton), which is different from the engine used by Perl (and I guess, Perl6): NFA (Nondeterministic Finite Automaton). The difference is in how the engine and the text work: the NFA engine is regex-driven, so the regex is matched against the text (and /a*/ matches against the beginning of the string). The DFA engine is test-driven, so the text is examined one character at the time and matched against the regex. – Fernando Santagata Dec 12 '18 at 16:06