1

Can the preg_match() function include groups it did not find in the matches array?

Here is the pattern I'm using:

/^([0-9]+)(.[0-9]+)?\s?([^iIbB])?([iI])?([bB])?$/

What I'm trying to is parse an human readable size into bytes. This pattern fits my requirement, but only if I can retrieve matches in the absolute group order.

This can produce upto 5 match groups, which would result in a matches array with indices 0-5. However if the string does not match all groups, then the matches array may have, for example, group 5 actually at index 3.

What I'd like is the final match in that pattern (5) to always be at the same index of the matches array. Because multiple groups are optional it's very important that when reading the matches array we know which group in the expression got matched.

Example situation: The regex tester at regexr.com will show all 5 groups including those not matched always in the correct order. By enabling the "global" and "multi-line" flags and using the following text, you can hover over the blue matches for a good visual.

500.2 KiB
256M
700 Mb
1.2GiB

You'll notice that not all groups are always matched, however the group indexes are always in the correct order.


Edit: Yes I did already try this in PHP with the following:

$matches    = [];
$matchesC   = 0;
$matchesN   = 6;
if (!preg_match("/^([0-9]+)(\.[0-9]+)?\s?([^iIbB])?([iI])?([bB])?$/", $size, $matches) || ($matchesC = count($matches)) < $matchesN) {
    print_r($matches);
    throw new \Exception(sprintf("Could not parse size string. (%d/%d)", $matchesC, $matchesN));
}

When $size is "256M" that print_r($matches); returns:

Array
(
    [0] => 256M
    [1] => 256
    [2] => 
    [3] => M
)

Groups 4 and 5 are missing.

Adambean
  • 1,096
  • 1
  • 9
  • 18
  • Did you test that in PHP? See https://ideone.com/NSm7Iy, all the "empty" groups are there. – Wiktor Stribiżew May 11 '17 at 10:26
  • Yes. A print_r() of the matches array does not include non-matched groups, causing the index of matched groups to skew. – Adambean May 11 '17 at 10:59
  • Yeah, it does not show the last items, [but they are there](https://ideone.com/3ondVH). All you need is to check if a group is empty or not with `empty($m[n])`. Or is it a must that `print_r` should *print* the empty group values? – Wiktor Stribiżew May 11 '17 at 11:06
  • I guess that's just a bit of extra work to iterate to the expected array size and do an `array_key_exists()` to fill in empty values. Kidna expected `preg_match()` to do that out of the box. – Adambean May 11 '17 at 11:11
  • There is one fun fact: the non-participating groups are just not initialized with an empty string value in PHP, so, Group 4 and 5 are *null* and you seem right, it is `preg_match` all to blame. – Wiktor Stribiżew May 11 '17 at 11:23

1 Answers1

0

The non-participating groups are just not initialized with an empty string value in PHP, so, Group 4 and 5 are null in case of '256M' string. It seems that preg_match discards those non-initialized values from the end of the array.

In your case, you can make your capturing groups non-optional, but the patterns inside optional.

$arr = array('500.2 KiB', '256M', '700 Mb', '1.2GiB');
foreach ($arr as $s) {
    if (preg_match('~^([0-9]+)(\.[0-9]+)?\s?([^ib]?)(i?)(b?)$~i', $s, $m)) {
        print_r($m) . "\n";
    }
}

Output:

Array
(
    [0] => 500.2 KiB
    [1] => 500
    [2] => .2
    [3] => K
    [4] => i
    [5] => B
)
Array
(
    [0] => 256M
    [1] => 256
    [2] => 
    [3] => M
    [4] => 
    [5] => 
)
Array
(
    [0] => 700 Mb
    [1] => 700
    [2] => 
    [3] => M
    [4] => 
    [5] => b
)
Array
(
    [0] => 1.2GiB
    [1] => 1
    [2] => .2
    [3] => G
    [4] => i
    [5] => B
)

See the PHP demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I was sure I had already checked with a `var_dump()` for keys with a null value, oh well. Making the groups non-optional worked, but not by moving the `?`'s into the group. Instead I used this pattern: `/^([0-9]+)\.?([0-9]*)\s?([^iIbB]*)([iI]*)([bB]*)$/` -- Side note: I need this case-sensitive, because "m" and "M" have different meanings in IEC power factors. – Adambean May 11 '17 at 13:57
  • Yeah, you made the patterns optional, just as I said. The `*` is also a quantifier that allows matching 0 occurrences of an atom. Note my regex **is** case insensitive - see **`~i`**. Also, I believe you could use `(?:\.?([0-9]+))?` instead of `\.?([0-9]*)`. – Wiktor Stribiżew May 11 '17 at 13:59
  • Quite odd that the `?` operator inside each group didn't work. (I expected it to.) – Adambean May 11 '17 at 14:00
  • It does, see my demo. – Wiktor Stribiżew May 11 '17 at 14:01