How preg_match_all() processes strings?

Question

I'm still learning a lot about PHP and string alteration is something that is of interest to me. I've used preg_match before for things like validating an email address or just searching for inquiries.

I just came from this post What's wrong in my regular expression? and was curious as to why the preg_match_all function produces 2 strings, 1 w/ some of the characters stripped and then the other w/ the desired output.

From what I understand about the function is that it goes over the string character by character using the RegEx to evaluate what to do with it. Could this RegEx have been structured in such a way as to bypass the first array entry and just produce the desired result?

and so you don't have to go to the other thread

$str = 'text^name1^Jony~text^secondname1^Smith~text^email1^example-
        free@wpdevelop.com~';

preg_match_all('/\^([^^]*?)\~/', $str, $newStr);

for($i=0;$i<count($newStr[0]);$i++)
{
    echo $newStr[0][$i].'<br>';
}

echo '<br><br><br>';

for($i=0;$i<count($newStr[1]);$i++)
{
    echo $newStr[1][$i].'<br>';
}

This will output

^Jony~
^Smith~
^example-free@wpdevelop.com~

Jony
Smith
example-free@wpdevelop.com

I'm curious if the reason for 2 array entries was due to the original sytax of the string or if it is the normal processing response of the function. Sorry if this shouldn't be here, but I'm really curious as to how this works.

thanks, Brodie

The output will always contain the entire match and an entry for each capture group in your expression. — Felix Kling, Oct 19 '11 at 21:39
Not an answer, but interesting to note here is the `PREG_SET_ORDER` flag, which will return a simpler result list. And while you cannot remove the `[0]` array entry for the complete match, you can strip its content using `\K` in the regex. — mario, Oct 19 '11 at 21:46

score 2 · Accepted Answer · answered Oct 19 '11 at 21:39

2

It's standard behavior for preg_match and preg_match_all - the first string in the "matched values" array is the FULL string that was caught by the regex pattern. The subsequent array values are the 'capture groups', whose existence depends on the placement/position of () pairs in the regex pattern.

In your regex's case, /\^([^^]*?)\~/, the full matching string would be

^   Jony    ~
|     |     |
^  ([^^]*?) ~   -> $newstr[0] = ^Jony~
                -> $newstr[1] = Jony (due to the `()` capture group).

answered Oct 19 '11 at 21:39

Marc B

356,200
43
426
500

Ah I understand, so the first thing it does is find the text starting w/ ^ and ending with ~ and then the second expression in () takes everything after the ^ minus the ~. I guess curiousity gets the best of me, if [^^]*? tells it to grab the text after '^' why then does it not grab the '~'? – Brodie Oct 19 '11 at 21:53
It does, but the `~` isn't inside your capture group, so it'll only show up in the `[0]` section. You can consider the entire regex pattern to be a capture group itself, so that that virtual capture is stored in `[0]`, and then any captures you explicitly create with `()` go into [1], [2], etc... – Marc B Oct 19 '11 at 21:56
`[^^]*?` translates to `as many characters (*, '0 or more') that are NOT a ^ ([^^]), in a non-greedy fashion (?). – Marc B Oct 19 '11 at 21:58

score 2 · Answer 2 · answered Oct 19 '11 at 21:50

Could this RegEx have been structured in such a way as to bypass the first array entry and just produce the desired result?

Absolutely. Use assertions. This regex:

preg_match_all('/(?<=\^)[^^]*?(?=~)/', $str, $newStr);

Results in:

Array
(
    [0] => Array
        (
            [0] => Jony
            [1] => Smith
            [2] => example-free@wpdevelop.com
        )

)

score 1 · Answer 3 · answered Oct 19 '11 at 21:39

1

As the manual states, this is the expected result (for the default PREG_PATTERN_ORDER flag). The first entry of $newStr contains all full pattern matches, the next result all matches for the first subpattern (in parentheses) and so on.

answered Oct 19 '11 at 21:39

mAu

2,020
1
14
27

score 1 · Answer 4 · answered Oct 19 '11 at 21:39

The first array in the result of preg_match_all returns the strings that match the whole pattern you passed to the preg_match_all() function, in your case /\^([^^]*?)\~/. Subsequent arrays in the result contain the matches for the parentheses in your pattern. Maybe it is easier to understand with an example:

$string = 'abcdefg';
preg_match_all('/ab(cd)e(fg)/', $string, $matches);

The $matches array will be

array(3) {
  [0]=>
  array(1) {
    [0]=>
    string(7) "abcdefg"
  }
  [1]=>
  array(1) {
    [0]=>
    string(2) "cd"
  }
  [2]=>
  array(1) {
    [0]=>
    string(2) "fg"
  }
}

The first array will contain the match of the entire pattern, in this case 'abcdefg'. The second array will contain the match for the first set of parentheses, in this case 'cd'. The third array will contain the match for the second set of parentheses, in this case 'fg'.

score 0 · Answer 5 · answered Dec 28 '12 at 12:05

Whenever you have problems to imagine the function of preg_match_all you should use an evaluator like preg_match_all tester @ regextester.net

This shows you the result in realtime and you can configure things like the result order, meta instructions, offset capturing and many more.

score 0 · Answer 6 · answered Oct 19 '11 at 21:38

[0] contains entire match, while [1] only a portion (the part you want to extract)... You can do var_dump($newStr) to see the array structure, you'll figure it out.

$str = 'text^name1^Jony~text^secondname1^Smith~text^email1^example-
        free@wpdevelop.com~';

preg_match_all('/\^([^^]*?)\~/', $str, $newStr);

$newStr = $newStr[1];
foreach($newStr as $key => $value)
{
    echo $value."\n"; 
}

This will result in... (weird result, haven't modified expression)

Jony
Smith
example-
        free@wpdevelop.com

How preg_match_all() processes strings?

6 Answers6