PHP PREG_REPLACE Returning wrong result depending on order checked

Question

I have stumbled upon a really odd bug with PHP's preg_replace function and some regex patterns. What I'm trying to do is replace custom tags delimited by brackets and convert them to HTML. The regex has to account for custom "fill" tags that will stay with the outputted HTML so that it can be replaced on-the-fly when the page loads (replacing with a site-name for instance).

Each regex pattern will work by itself, but for some reason, some of them will exit the function early if preceded by one of the other patterns is checked first. When I stumbled upon this, I used preg_match and a foreach loop to check the patterns before moving on and would return the result if found - so hypothetically it would seem fresh to each pattern.

This didn't work either.

Check Code:

function replaceLTags($originalString){
    $patterns = array(
                '#^\[l\]([^\s]+)\[/l\]$#i' => '<a href="$1">$1</a>',
                '#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'=> '<a href="$1">$2</a>',
                '#^\[l=([^\s]+) title=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" title="$2">$3</a>',
                '#^\[l=([^\s]+) rel=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" rel="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2">$3</a>',
                '#^\[l=([^\s]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" style="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+)] target=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" rel="$3" target="$4">$5</a>'
            );

    foreach ($patterns as $pattern => $replace){
        if (preg_match($pattern, $originalString)){
            return preg_replace($pattern, $replace, $originalString);
        }
    }
}

$string = '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]';

echo $alteredString = $format->replaceLTags($string);

The above "String" would come out as:

<a href="[site_url">/site-category/ class=hello rel=nofollow target=_blank]Hello there</a>

When it should come out as:

<a href="[site_url]/site-category/" class="hello" rel="nofollow" target="_blank">Hello there</a>

But if moved that pattern further up in the list to be checked sooner, it'd format correctly.

I'm stumped, because it seems like the string is being overwritten somehow every time it's checked even though that makes no sense.

Do you really want to have `return` here? `return preg_replace($pattern, $replace, $originalString);` Should it be `$originalString = preg_replace($pattern, $replace, $originalString);` so the loop can continue processing? — drew010, Sep 22 '11 at 21:14
I thought about that, Drew. But the foreach loop should continue through as long as a match isn't found. You wouldn't want the loop to continue if it's found a match, it'd just be wasting CPU resources. That's why I tell it to return once it's found one. I'd like to get it to pulling out the attribute name and value so I can just have one regex pattern that'd be able to backtrace through all attributes in a tag (no matter the type: img, a, p, div, etc.), but so far I haven't been able to get it working successfully. I spent way too much time on it last week. — rexibit, Sep 22 '11 at 21:25

Alan Moore · Accepted Answer · 2011-09-23T14:10:02.807

2

Seems to me you're doing a lot more work than you need to. Instead of using a separate regex/replacement for each possible list of attributes, why not use preg_replace_callback to process the attributes in a separate step? For example:

function replaceLTags($originalString){
  return preg_replace_callback('#\[l=((?>[^\s\[\]]+|\[site_url\])+)([^\]]*)\](.*?)\[/l\]#',
                               replaceWithinTags, $originalString);
}

function replaceWithinTags($groups){
  return '<a href="' . $groups[1] . '"' . 
         preg_replace('#(\s+\w+)=(\S+)#', '$1="$2"', $groups[2]) .
         '>' . $groups[3] . '</a>';
}

See a complete demo here (updated; see comments).

Here's an updated version of the code based on new information that was provided in the comments:

function replaceLTags($originalString){
  return preg_replace_callback('#\[l=((?>[^\s\[\]]+|\[\w+\])+)([^\]]*)\](.*?)\[/l\]#',
                               replaceWithinTags, $originalString);
}

function replaceWithinTags($groups){
  return '<a href="' . $groups[1] . '"' . 
         preg_replace(
             '#(\s+[^\s=]+)\s*=\s*([^\s=]+(?>\s+[^\s=]+)*(?!\s*=))#',
             '$1="$2"', $groups[2]) .
         '>' . $groups[3] . '</a>';
}

demo

In the first regex I changed [site_url] to \[\w+\] so it can match any custom fill tag.

Here's a breakdown of the second regex:

(\s+[^\s=]+)   # the attribute name and its leading whitespace
\s*=\s*
(
  [^\s=]+   # the first word of the attribute value
  (?>\s+[^\s=]+)*  # the second and subsequent words, if any
  (?!\s*=)  # prevents the group above from consuming tag names
)

The trickiest part is matching multi-word attribute values. (?>\s+[^\s=]+)* will always consume the next tag name if there is one, but the lookahead forces it to backtrack. Normally it would only back off one character at a time, but the atomic group effectively forces it to backtrack by whole words or not at all.

edited Sep 23 '11 at 14:10

answered Sep 22 '11 at 22:19

Alan Moore

73,866
12
100
156

That's awesome Alan! It's missing two things though: " . '>' " after the second item in the groups array so the closing > isn't left out of the first link tag, and the ability to be able to take multiple words for the value of the attribute other than a URL. That was what was holding me up last week when I was trying to make a universal function like yours (yours came out a lot closer). I'd +1 you if I had the points. – rexibit Sep 22 '11 at 22:35
I assumed attribute values couldn't contain whitespace because there were no quotes around any of them in your original string. Are you saying some of them could be quoted? If so, could some be quoted with double-quotes and others with single-quotes, as in HTML? Can an attribute value contain whitespace *without* being quoted, as in `class=hello world`? I fixed the missing angle-bracket error, but I'll hold off on the other issue pending feedback. – Alan Moore Sep 23 '11 at 04:46
In the square bracket tags, attribute values aren't quoted because they are passed before htmlentities is ran, so any quotes would be removed for a special character. – rexibit Sep 23 '11 at 11:55
To your other question about the custom fill tags, only [site_url] would be showing at the beginning of the pattern. The other tags would be [site_name], [category], [manufacturer], [collection], [style], and [color], but those would most likely not be appearing except on product pages and generally only inside div or p tags. I don't think they'd be found inside a title attribute's value because the fill tags are mainly to help madlib content and in order to use for a link the person would need to know a URL to go with it, and then they could just type the style or color out. – rexibit Sep 23 '11 at 11:55
Oh, I modified your code a little bit to be able to handle [l] tags in any case, and to be able to return links if there're no attributes in them. Here's the revised version http://codepad.org/iT7unQ1P . The last two example strings have their attribute values cut off still, though, since there are multiple words. – rexibit Sep 23 '11 at 13:27
That works great! I added the little bits from my previous comment to deal with [l] tags with no attributes and it's working lovely. Thanks for explaining the regex, I'm a noob at backtracing and having expressions check before and after a match. – rexibit Sep 23 '11 at 14:37

score 1 · Answer 2 · answered Sep 22 '11 at 21:20

You messed up the regular expressions. If you print the string on each iteration as:

foreach ($patterns as $pattern => $replace){
    echo "String: $originalString\n";
    if (preg_match($pattern, $originalString)){
        return preg_replace($pattern, $replace, $originalString);
    }
}

you will see that the string is not modified. From my run, I noticed that the second regular expression matches. I placed a third param to the preg_match call and printed the matches. Here is what I got:

Array (
    [0] => [l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]
    [1] => [site_url
    [2] => /site-category/ class=hello rel=nofollow target=_blank]Hello there )

Thanks for pointing that out! I see what you're saying. I forgot to escape the trailing closing ] on the first tag for each of the patterns. I'm working on how to get the second pattern to close on a [. Hopefully that'll solve it. — rexibit, Sep 22 '11 at 21:47

score 1 · Answer 3 · edited May 23 '17 at 11:55

The cause of your immediate problem at hand is twofold:

First, there is a typo in the applicable regex (the last one in the array). It has an extraneous literal right square bracket before the: " target=". In other words, this:

'#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+)] target=([^\[]+)]([^\[]+)\[/l\]$#i'

Should read:

'#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+) target=([^\[]+)]([^\[]+)\[/l\]$#i'

Second, there are two regexes in the array which both match the same string, and unfortunately the more specific of the two (the regex above which is the one we want), comes second. The other more general regex that matches is the second one in the array:

'#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'

Placing the more general regex last and removing the extraneous square bracket solves the problem. Here is your original code fixed with the above two changes applied:

function replaceLTags($originalString){
    $patterns = array(
                '#^\[l\]([^\s]+)\[/l\]$#i' => '<a href="$1">$1</a>',
                '#^\[l=([^\s]+) title=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" title="$2">$3</a>',
                '#^\[l=([^\s]+) rel=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" rel="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2">$3</a>',
                '#^\[l=([^\s]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" style="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+) target=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" rel="$3" target="$4">$5</a>',
                '#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'=> '<a href="$1">$2</a>'
            );

    foreach ($patterns as $pattern => $replace){
        if (preg_match($pattern, $originalString)){
            return preg_replace($pattern, $replace, $originalString);
        }
    }
}

$string = '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]';

echo $alteredString = $format->replaceLTags($string);

Note that this only fixes the immediate specific error described in your question and does not address some more fundamental problems with what you are attempting to accomplish. I have presented a somewhat better solution as an answer to your follow-up question: How do I make this REGEX ignore = in a tag's attribute?.

But as others have mentioned, mixing two different markup languages together and processing with regex is asking for trouble.

score 0 · Answer 4 · answered Sep 22 '11 at 22:12

Here is some general purpose code you can use to have less expressions, you could always remove any tags that are not allowed from the final string.

<?php

function replaceLTags($originalString) {
    if (preg_match('#^\[l\]([^\s]+)\[/l\]$#i', $originalString)) {
        // match a link with no description or tags
        return preg_replace('#^\[l\]([^\s]+)\[/l\]$#i', '<a href="$1">$1</a>', $originalString);
    } else if (preg_match('#^\[l=([^\s]+)\s*([^\]]*)\](.*?)\[/l\]#i', $originalString, $matches)) {
        // match a link with title and/or tags
        $attribs = $matches[2];
        $attrStr = '';
        if (preg_match_all('#([^=]+)=([^\s\]]+)#i', $attribs, $attribMatches) > 0) {
            $attrStr = ' ';
            for ($i = 0; $i < sizeof($attribMatches[0]); ++$i) {
                $attrStr .= $attribMatches[1][$i] . '="' . $attribMatches[2][$i] . '" ';
            }
            $attrStr = rtrim($attrStr);
        }

        return '<a href="' . $matches[1] . '"' . $attrStr . '>' . $matches[3] . '</a>';
    } else {
        return $originalString;
    }
}

$strings = array(
    '[l]http://www.stackoverflow.com[/l]',
    '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]',
    '[l=[site_url]/page.php?q=123]Link[/l]',
    '[l=http://www.stackoverflow.com/careers/ target=_blank class=default]Stack overflow[/l]'
);

foreach($strings as $string) {
    $altered = replaceLTags($string);
    echo "{$altered}<br />\n";
}

Really nice Drew. I'd use that but I need to be able to have multiple words in an attribute's value (like the title, style, or javascript which have spaces). Other than that, really cool. I'd +1 you if I had the points. — rexibit, Sep 22 '11 at 22:38
I see, without having quotes around the values, it does make it a bit harder to use something general like that. You could match on something like `(class|rel|title|target|...)=(...)` but writing the second part of that expression becomes difficult. — drew010, Sep 22 '11 at 22:45

PHP PREG_REPLACE Returning wrong result depending on order checked

4 Answers4

Linked