How do I make this REGEX ignore = in a tag's attribute?

Question

Alan Moore was very helpful in solving my earlier problem, but I didn't realise until just now that the REGEX he wrote for pulling out all of a tag's attributes will break prematurely if there's an equal sign in a URL. I've spent a good while on this, trying different modifications with lookaheads and behinds, to no avail.

I need this regex to break on: space + word + = , but it's breaking even if there's no space, only a letter and an =.

This is mainly only an issue when I'm formatting a tag that has an onclick event with Javascript, such as opening a window or calling a a script (form action).

Regex:

#(\s+[^\s=]+)\s*=\s*([^\s=]+(?>\s+[^\s=]+)*(?!\s*=))#i

Code to check on:

 onClick=window.open('http%3A%2F%2Fwww.stackoverflow.com%2Ffakeindex.php%3Fsomevariable%3Dsomevalue','popup','scrollbars=yes,resizable=yes,width=716,height=540,left=0,top=0,ScreenX=0,ScreenY=0'); class=someclass

What it does:

The above will break on the letter prior to the =, so in this case that the URL is encoded, it breaks on "s" in "scrollbars=yes".

I can encode the URL to get around the =, but the rest of the javascript makes it still a problem since there are variables and values which require the =. If the REGEX forced it to allow = and only break on spaces followed by a word followed by that = like is should be doing, then I should be able to have the javascript work in the custom tags since the entire onClick string would be captured as the value.

Please read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. — bmargulies, Sep 25 '11 at 21:28
I would give up on regex for parsing HTML - it will only lead to headaches. Take [bmargulies](http://stackoverflow.com/users/131433/bmargulies)'s advice and read the link. — Bohemian, Sep 25 '11 at 21:38
I have read that bmargulies, a few times over the years actually. I prefer to use XPATH and parsing libraries to do it, but what I'm trying to do can't be done with said libraries (none that I know of). I was actually planning on releasing the PHP code that this finishes with on github when my project for a client is done, since the tag class is pretty generic when you strip away their custom requirements. — rexibit, Sep 26 '11 at 00:46

score 3 · Answer 1 · edited May 23 '17 at 11:55

Disclaimer:

As others have already stated/emphasized, using regex with HTML is fraught with potential gotcha's. Doing so with a mix of two intermingled markup languages, like you have here, is even more perilous. There are lots of ways for this solution (and any like it) to fail.

That said...

Answering this question requires an understanding of your preceding question (PHP PREG_REPLACE Returning wrong result depending on order checked). Note that I added an answer to that question as well with a solution consisting of minimal change to the original code. What follows is another answer with a somewhat improved solution. (Both of these answers fix both specific problems.)

Some random comments on your original code:

The expression: [^\s]+ can be shortened to: \S+
With the foreach statement, the order of processing is not guaranteed. (And the order is important here - although this is probably not an issue since the array is declared all at once so should have the correct order.)
You are using ([^\[]+) to capture the attribute value. I think you meant to use ([^\]]+) (but even that is not the best expression).
Using ([^\[]+) (or ([^\]]+)) to capture the attribute value does not allow for square brackets to appear within the value.
The regexes are not written in free spacing mode and contain no comments.
Having unquoted attribute values with multiple words introduces quite a bit of potential ambiguity. What if you wanted to have a title attribute like this: title="CSS class is specified: class=myclass"? You should really be delimiting these attribute values.

A (somewhat) better solution:

Assumptions:

All Ltags will be well formed.
Ltags are never nested.
Ltag attributes are separated by a "SPACE+WORD+=" sequence.
Other [specialtags] may appear anywhere inside an Ltag except within the "SPACE+WORD+=" attribute separator sequences.
All Ltag attribute values never contain: "SPACE+WORD+=" sequence. This includes multi-word titles and Javascript snippets inside an onClick.

I assume you know precisely what will be occurring within the Ltag attributes and that they will conform to the above requirements.

Here is a somewhat improved version of replaceLTags(), which uses a callback function to parse and wrap each attribute value with double quotes. The complex regexes are fully commented.

// Convert all Ltags to HTML links.
function replaceLTags($str){
    // Case 1: No URL specified in Ltag open tag: "[l]URL[/l]"
    $re1 = '%\[l\](.*?)\[/l\]%i';
    $str = preg_replace($re1, '<a href="$1">$1</a>', $str);

    // Case 2: URL specified in Ltag open tag: "[l=URL attr=val]linktext[/l]"
    $re2 = '%
        # Match special Ltag construct: [l=url att=value]linktext[/l]
        \[l=                 # Literal start-of-open-Ltag sequence.
        (\S+)                # $1: link URL.
        (                    # $2: Any/all optional attributes.
          [^[\]]*            # {normal*} = Zero or more non-[]
          (?:                # "Unroll-the-loop" (See: MRE3)
            \[[^[\]]*\]      # {special} = matching [square brackets]
            [^[\]]*          # More {normal*} = Zero or more non-[]
          )*                 # End {(special normal*)*} construct.
        )                    # End $2: Optional attributes.
        \]                   # Literal end-of-open-Ltag sequence.
        (.*?)                # $3: Ltag link text contents.
        \[/l\]               # Literal close-Ltag sequence.
        %six';
    return preg_replace_callback($re2, '_replaceLTags_cb', $str);
}
// Callback function wraps values in quotes and converts to HTML.
function _replaceLTags_cb($matches) {
    // Wrap each attribute value in double quotes.
    $matches[2] = preg_replace('/
        # Match one Ltag attribute name=value pair.
        (\s+\w+=)        # $1: Space, attrib name, equals sign.
        (                # $2: Attribute value.
          (?:            # One or more non-start-of-next-attrib
            (?!\s+\w+=)  # If this char is not start of next attrib,
            .            # then match next char of attribute value.
          )+             # Step through value one char at a time.
        )                # End $2: Attribute value.
        /sx', '$1"$2"', $matches[2]);
    // Put humpty back together again.
    return '<a href="'. $matches[1] .'"'.
        $matches[2] .'>'. $matches[3] .'</a>';
}

The main function regex, $re2, matches an Ltag element, but does not attempt to parse individual open tag attributes - it globs (and captures into group $2) all the attributes into one substring. This substring containing all the attributes is then parsed by the regex in the callback function, which uses the desired "SPACE+WORD+=" expression as a separator between name=value pairs.

Note that this function can be passed a string containing multiple Ltags and all will be processed in one go. It will also correctly handle IPv6 literal URL addresses such as: http://[::1:2:3:4:5:6:7] (which contain square brackets).

If you insist on going down this road, I would recommend using a delimiter for the attribute values. I know you said that you can't use the double quote for some reason, but you could use a special character such as '\1' (ASCII 001), then replace that with double quotes in the callback function. This would dramatically cut down on the list of possible ways for this to fail.

I do wish S.O. would *reliably* inform me when an answer is posted while I'm composing. Fascinating, though, how our solutions ended up so similar despite the very different choices we made along the way. Unrolled loop vs. brute-force alternation with possessive quantifiers/atomic groups, splitting on spaces between name=value pairs vs. actively matching them,... This is why I love this stuff! — Alan Moore, Sep 28 '11 at 01:55
@Alan Moore - I was thinking the very same thing when I came back and saw your answer. A love of solving complex regex problems such as this one is a rather rare attribute I think. It sure was a pleasant surprise when I discovered this place (SO) and found that I am not alone. Cheers! — ridgerunner, Sep 28 '11 at 03:37

score 0 · Answer 2 · answered Sep 27 '11 at 23:25

If you can guarantee that the pattern will never occur inside an attribute value, you could split the string on this regex:

\s+(?=\w+=)

That actually simplifies the problem quite a bit. The code below assumes the URL (which may contain custom [fill] tags) ends at the first whitespace (if present) or at the closing bracket of the [l] tag. Everything after the first whitespace is assumed to be a series of whitespace-separated name=value pairs, where the name always matches ^\w+$ and the value never contains a match for \s+\w+=. Values may also contain [fill] tags.

function replaceLTags($originalString)
{
  return preg_replace_callback(
    '#\[l=((?>[^\s\[\]]++|\[\w+\])+)(?:\s+((?>[^\[\]]++|\[\w+\])+))?\](.*?)\[/l\]#',
    replaceWithinTags, $originalString);
}

function replaceWithinTags($groups)
{
  $result = "<a href=\"$groups[1]\"";
  $attrs = preg_split('~\s+(?=\w+=)~', $groups[2]);
  foreach ($attrs as $a)
  {
    $result .= preg_replace('#\s*(\w+)=(.*)#', ' $1="$2"', $a);
  }
  $result .= ">$groups[3]</a>";
  return $result;
}

demo

I'm also assuming there are no double-quotes in the attribute values. If there are, the replacement will still work but the resulting HTML will be invalid. If you can't guarantee the absence of double-quotes, you may have to URL-encode them or something before doing these replacements.

How do I make this REGEX ignore = in a tag's attribute?

2 Answers2

Disclaimer:

That said...

A (somewhat) better solution: