3

I am rather new to regex and am stuck on the following where I try to use preg_match_all to count the number of hello after world.

If I use "world".+(hello), it counts to the in the last hello; "world".*?(hello) stops in the first hello, both giving one count.

blah blah blah
hello
blah blah blah
class="world" 
blah blah blah
hello 
blah blah
hello
blah blah blah
hello
blah blah blah

I am expecting 3 as the count because the hello before world should not be counted.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
limestreetlab
  • 173
  • 1
  • 11
  • Wiktor demonstrated the required pattern over 5 years ago @ [Regex to match specific words after one word](https://stackoverflow.com/q/35792048/2943403). – mickmackusa Nov 04 '21 at 11:47

4 Answers4

2

Another option with simple regexes:

if(preg_match('/"world".*/s', $str, $out)) {
  echo preg_match_all('/\bhello\b/', $out[0]);
}

See demo at tio.run

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • 1
    Isn't this doing the same thing as that suggested by @Tim Biegeleisen ? grabbing whatever after *world* and then counting *hello* whole words? – limestreetlab Nov 04 '21 at 10:20
  • 1
    @deanstreet this one does 1.) match `"world"`... and anything after it 2.) if there is a match, it counts in the output all `hello`s. It is mandatory, that `"world"` exists in the string and the `hello`s occure after `"world"` to be counted in this one. – bobble bubble Nov 04 '21 at 10:23
  • As a newbie, this divide-and-conquer is definitely easier to digest than the oneline regex suggested above, but that oneline works great. – limestreetlab Nov 04 '21 at 10:27
1

You can use a single preg_match_all call here:

$text = "blah blah blah\nhello\nblah blah blah\nclass=\"world\" \nblah blah blah\nhello \nblah blah\nhello\nblah blah blah\nhello\nblah blah blah";
echo preg_match_all('~(?:\G(?!^)|\bworld\b).*?\K\bhello\b~s', $text);

See the regex demo and the PHP demo. Details:

  • (?:\G(?!^)|\bworld\b) - end of the previous match (\G(?!^) does this check: \G matches either start of the string or end of the previous match position, so we need to exclude the start of string position, and this is done with the (?!^) negative lookahead) or a whole word world
  • .*? - any zero or more chars, as few as possible
  • \K - discards all text matched so far
  • \bhello\b - a whole word hello.

NOTE: If you do not need word boundary check, you may remove \b from the pattern.

If hello and world are user-defined patterns, you must preg_quote them in the pattern:

$start = "world";
$find = "hello";
$text = "blah blah blah\nhello\nblah blah blah\nclass=\"world\" \nblah blah blah\nhello \nblah blah\nhello\nblah blah blah\nhello\nblah blah blah";
echo preg_match_all('~(?:\G(?!^)|' . preg_quote($start, '~') . '\b).*?\K' . preg_quote($find, '~') . '~s', $text);
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • it works as intended, but just personally I have difficulty understanding it. what does (?!^) do ? I see excluding it will include matches before *world*. – limestreetlab Nov 04 '21 at 10:08
  • 1
    @deanstreet I added this explanation to the answer. Also, see [Continuing at The End of The Previous Match](https://www.regular-expressions.info/continue.html). Also, if you want to learn more about `\G` see [my YT video about `\G` use cases](https://www.youtube.com/watch?v=dsGUbvW5hsE&list=PL0l350Bvl3lI_KqlEAErGKBETpPtSsS9r&index=8). – Wiktor Stribiżew Nov 04 '21 at 10:10
  • 1
    I don't think you need `\K` here ;) – bobble bubble Nov 04 '21 at 10:15
  • @bobblebubble That is of least importance, not necessary, but later, it might turn out a life savior :) – Wiktor Stribiżew Nov 04 '21 at 10:16
  • 1
    Something looks familiar: https://stackoverflow.com/a/35792544/2943403 – mickmackusa Nov 04 '21 at 11:32
  • @mickmackusa Right, that is about replacing words after another word. There is also another problem related to the use of Unicode strings with `preg*` functions. – Wiktor Stribiżew Nov 04 '21 at 11:40
1

Other way: force the pattern to fail and to not retry if world doesn't exist in the string:

~(?:\A(*COMMIT).*?world)?.*?hello~s

demo

The non-capturing group is optional but greedy. Consequence, it is tested each time the pattern is tried.
It begins with the \A anchor that matches the start of the string, so this is the only position where this group can succeed. After the start of the string, at other positions \A fails and since the group is optional, the remaining subpattern in it is ignored and the research continues with .*?hello.
Immediately after, there's the backtracking control verb (*COMMIT) that in case of failure after it, forces the pattern to not be retried at all. (end of the story).

In other words, if this group fails at the start of the string, the research is aborted once and for all.

Advantage: it needs less steps than a \G based pattern.


To be more efficient, a \G based pattern can also be written this way (using an optional group instead of an alternation):

~(?:\A.*?world)?(?!\A).*?hello~sA

Here the A modifier takes the role of the \G anchor, but it's exactly the same than starting each branch of a pattern (only one here) with the \G anchor.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

One approach might be to first strip off the leading portion of the string up to, and including, the first occurrence of world. Then call preg_match_all as you already are doing and get the count of occurrences of hello.

$input = "blah blah blah
hello
blah blah blah
class=\"world\" 
blah blah blah
hello 
blah blah
hello
blah blah blah
hello
blah blah blah";

$input = preg_replace("/^.*?\bworld/", "", $input);
preg_match_all("/\bhello\b/", $input, $matches);
echo sizeof($matches[0]);  // 4
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 2
    This answer is flawed because it will fail if the trigger word (`world`) does not exist in the string. The output is still `4` even if `world` is missing. Proof: https://3v4l.org/58vXC – mickmackusa Nov 04 '21 at 20:03