3

I have a couple of "shortcode" blocks in a text, which I want to replace with some HTML entities on the fly using preg_replace_callback.

The syntax of a shortcode is simple:

[block:type-of-the-block attribute-name1:value attribute-name2:value ...]

Attributes with values may be provided in any order. Sample regex pattern I use to find these shortcode blocks:

/\[
    (?:block:(?<block>piechart))
    (?:
        (?:\s+value:(?<value>[0-9]+)) |
        (?:\s+stroke:(?<stroke>[0-9]+)) |
        (?:\s+angle:(?<angle>[0-9]+)) |
        (?:\s+colorset:(?<colorset>reds|yellows|blues))
    )*
\]/xumi

Now, here comes the funny thing: PHP matches non-existent named groups. For a string like this:

[block:piechart colorset:reds value:20]

...the resulting $matches array is (note the empty strings in "stroke" and "angle"):

array(11) {
  [0]=>
  string(39) "[block:piechart colorset:reds value:20]"
  ["block"]=>
  string(8) "piechart"
  [1]=>
  string(8) "piechart"
  ["value"]=>
  string(2) "20"
  [2]=>
  string(2) "20"
  ["stroke"]=>
  string(0) ""
  [3]=>
  string(0) ""
  ["angle"]=>
  string(0) ""
  [4]=>
  string(0) ""
  ["colorset"]=>
  string(4) "reds"
  [5]=>
  string(4) "reds"
}

Here's the code for testing (you can execute it online here as well: https://onlinephp.io/c/2429a):

$pattern = "
/\[
    (?:block:(?<block>piechart))
    (?:
        (?:\s+value:(?<value>[0-9]+)) |
        (?:\s+stroke:(?<stroke>[0-9]+)) |
        (?:\s+angle:(?<angle>[0-9]+)) |
        (?:\s+colorset:(?<colorset>reds|yellows|blues))
    )*
\]/xumi";
$subject = "here is a block to be replaced [block:piechart value:25   angle:720]  [block] and another one [block:piechart colorset:reds value:20]";
preg_replace_callback($pattern, 'callbackFunction', $subject);

function callbackFunction($matches)
{
    var_dump($matches);

    // process matched values, return some replacement...
    $replacement = "...";

    return $replacement;
};

Is it normal that PHP creates empty entries in $matches array, just in case of a match, but doesn't clean it up when no actual match is found? What am I doing wrong? How to prevent PHP from creating these false entries, which simply shouldn't be there?

Any help or explanation would be deeply appreciated! Thanks!

misioptysio
  • 153
  • 1
  • 6
  • And sorry for the long post, I tried to be as detailed as possible. – misioptysio Jul 05 '22 at 23:27
  • I'd use [some very different approach (**demo**)](https://3v4l.org/rfMYe) by doing more in the callback than on regex side and use a pretty [simple pattern](https://regex101.com/r/rzEKO4/1). It's Just an idea, won't put any answer, as I'm about to turning off the computer. – bobble bubble Jul 06 '22 at 00:39

2 Answers2

3

This behaviour is as expected, although not well documented. In the manual under "Subpatterns":

When the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller

and:

Consider the following regex matched against the string Sunday:

(?:(Sat)ur|(Sun))day

Here Sun is stored in backreference 2, while backreference 1 is empty

and also in the documentation of the PREG_UNMATCHED_AS_NULL flag (new as of version 7.2.0). From the manual:

If this flag is passed, unmatched subpatterns are reported as null; otherwise they are reported as an empty string.

Which then gives you a way to work around this behaviour:

preg_replace_callback($pattern, 'callbackFunction', $subject, -1, $count, PREG_UNMATCHED_AS_NULL);

If you take this approach then in your callback you could filter the $matches array using array_filter to remove the NULL values.

$matches = array_filter($matches, function ($v) { return !is_null($v); }))

Demo on 3v4l.org

Nick
  • 138,499
  • 22
  • 57
  • 95
  • Good find! I work since more than 20 years with php & regex and have neither ever seen this flag nor would have had a use case for it. – bobble bubble Jul 06 '22 at 00:51
  • 1
    @bobblebubble it's new as of 7.2 so if (like me) you don't read the manual very often you wouldn't have seen it. – Nick Jul 06 '22 at 00:53
  • Thanks, @Nick. Believe it or not, I was visiting the manual at this "Sunday" example yesterday, but ignored it as I classified my problem as "named groups" ;). Yet, the solution with PREG_UNMATCHED_AS_NULL flag and filtering out the empty (null) matches is neat and solves my problem brilliantly. Thanks once more! :) – misioptysio Jul 06 '22 at 09:38
  • ...just a tiny comment at the end: array_filter(...) removes values that eval to false, so "stroke:0" or "value:0" will disappear as well, not only NULLS. – misioptysio Jul 06 '22 at 10:39
  • @misioptysio Fyi: `array_filter` without parameter filters out [`empty()`](https://www.php.net/manual/en/function.empty.php) values. Even without `PREG_UNMATCHED_AS_NULL` it would filter out the zero-width strings (same result) but further eg substrings like `0`. I think this flag only makes sense, if you distinguish to `null`. – bobble bubble Jul 06 '22 at 10:39
  • @misioptysio `$matches = array_filter($matches, function($v){ return !is_null($v); });` can be an idea to distinguish. – bobble bubble Jul 06 '22 at 10:50
  • @misioptysio apologies for that, I wasn't thinking carefully enough. I've made the edit as proposed by bobblebubble to the answer and demo (with a `0` test value in the new demo to show it working) – Nick Jul 06 '22 at 13:09
  • 1
    @bobblebubble you'd think it was late where I was... apparently I needed more coffee. Thanks for the suggestion, I've edited appropriately – Nick Jul 06 '22 at 13:10
  • I say, just build the callback to respect nulls instead of doing the extra loop to filter. I guess we need to see the nittygritty of the processing to give clear advice. The `m` modifier is useless here. Why repeat `\s+` in each branch? – mickmackusa Jul 06 '22 at 19:07
  • 1
    @bobblebubble I wasn't precise enough, when I wrote my comment, forgive me. I noticed that all empty values had been removed by array_filter(...) lacking callback function, so I rushed to share the news, but didn't provide a solution with a function checking for NULL values, which was similar to yours. Anyway, the code, the flag, the filtering - they work like charm right now! :D Thank you so much for your patience and dedication, both Nick and bobblebubble! – misioptysio Jul 06 '22 at 19:16
  • 1
    @misioptysio no worries, I found it so funny, that we had the same find almost simultaneusly :) As mentioned, myself I'd prefer a completely different approach, which I tried to illustrate in my comment at your question: Simple pattern and use the callback for almost everything else. I just see too many potential issues in repeating plus validating lots of optional groups. Even if I really love regex. – bobble bubble Jul 06 '22 at 19:31
  • @mickmackusa 1. The `m` modifier is useless in this case, that's true, yet I use this piece of code with different patterns, so I left it... um, for compatibility reasons ;) 2. The `\s+`: I parse differently formatted shortcodes, depending on who's entering them. Some people use one space to separate the attributes, other use enter, couple of spaces etc. The pattern here is that every attr is preceded by at least one whitespace, which separates if from previous attr - it is straightforward, but may not be the most efficient. I will gladly see a more elegant solution, though. – misioptysio Jul 06 '22 at 20:06
  • @mis I have had to do this a number of times in my previous dev role -- parsing and replacing square braced placeholders with variable attributes. I think you've made a rod for your back by allowing unordered attributes. This will cost readability, brevity, performance (which is important on large texts), and maintainability in your processing code. If it's not too late in your roll out, demand that all attributes after the tagname must be ordered by attribute (alphabetically). Then the regex can simply use successive optional checks. PHP functions, too, have strictly ordered params. – mickmackusa Jul 07 '22 at 02:02
0

You may not be in favor of a refactor, but that is what I recommend. Ideally, you could dedicate a fully-fledged class, but as a simple demonstration I'll show a couple rudimentary functions.

The goal not being script speed or brevity, but actually putting maintainability and your development team as top priority.

By establishing a foundational way to identify, parse, and route [block] placeholders, you remove the requirement for future developers to possess a deep understanding of regex. Instead, "block" attributes can be added, altered, or removed with maximum ease.

My buildPiechart() function should not be taken literally. It is a hastily written script which suggests leveraging validation and sanitization of user-supplied data before dynamically building a return string.

Code: (Demo)

function renderBlock(array $m) {
    $callable = "build$m[1]";
    return function_exists($callable)
        ? $callable($m[2] ?? '')
        : $m[0];
}

function buildPiechart(string $payloadString) {
    $values = [
        'angle' => 0,
        'colorset' => 'red',
        'stroke' => 1,
        'value' => 1
    ];
    $rules = [
        'angle' => '/\d+/',
        'colorset' => '/reds|yellows|blues/i',
        'stroke' => '/\d+/',
        'value' => '/\d+/',
    ];
    $attributes = preg_split(
        '/\h+/u',
        $payloadString,
        0,
        PREG_SPLIT_NO_EMPTY
    );
    foreach ($attributes as $pair) {
        [$key, $value] = explode(':', $pair, 2);
        if (
            key_exists($key, $values)
            && preg_match($rules[$key] ?? '/.*/u', $value, $m)
        ) {
            $values[$key] = $m[0];
        }
    }
    return sprintf(
        '<pie a="%s" c="%s" s="%s" v="%s">',
        ...array_values($values)
    );
}

$text = 'here is a block to be replaced [block:piechart value:25   angle:0]  [block] and [block:notcoded attr:val] another one [block:piechart colorset:reds value:20]';

echo preg_replace_callback(
         '/\[block:([a-z]+)\h*([^\]\r\n]*)\]/u',
         'renderBlock',
         $text
     );

Output:

here is a block to be replaced <pie a="0" c="red" s="1" v="25">  [block] and [block:notcoded attr:val] another one <pie a="0" c="reds" s="1" v="20">

It has been my professional experience that when clients find out that you can provide dynamic placeholder substitutions -- it's like getting the first tattoo -- they are almost certain to want more. The next feature request might be to extend a placeholder to accept more attributes or to support a whole new placeholder. This foundation will save you a lot if time and heartache because the functionality is already abstracted into simpler parts.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136