1

I have 2 regex below which works in php 5.6 but will not work in php 7.x if the character + white space > 1035 ish. Ex: [spoiler]1035 count of char + ws[/spoiler] . I used site, http://regexr.com/, to test the regex with http://pastebin.com/5bWzhNvy and it seems to work fine.

From PHP 7 regex not working the same way as in 5, i can see that issue could be with regex timing out? But here's the code unless I am mistaken.

// [spoiler]Text[/spoiler]
preg_match_all('/\[spoiler\]\s*((\s|.)+?)\s*\[\/spoiler\]/', $s, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
    $id = substr(md5($match[1]), 0, 10) . mt_rand(0, 10000);
    $s  = str_replace($match[0],
                      '<a href="javascript:klappe_function(\'' . $id . '\')" style="color: #aaa" title="Open Spoiler"><img src="images/plus.png" id="pic' . $id . '" class="spoiler" alt="spoiler" /></a><div id="k' . $id . '" style="display: none;">' . $match[1] . '</div><br />',
                      $s);
}

// [spoiler=Heading]Text[/spoiler]
preg_match_all('/\[spoiler=(.+?)\]\s*((\s|.)+?)\s*\[\/spoiler\]/', $s, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
    $id = substr(md5($match[2]), 0, 10). mt_rand(0, 10000);
    $s  = str_replace($match[0],
                      '<a href="javascript:klappe_function(\'' . $id . '\')" style="color: #aaa" title="Open ' . myhtmlentities($match[1], ENT_QUOTES)  . '"><img src="images/plus.png" id="pic' . $id . '" class="spoiler" alt="spoiler" /> <b>' . $match[1] . '</b></a><div id="k' . $id . '" style="display: none;">' . $match[2] . '</div><br />',
                      $s);
}
Community
  • 1
  • 1
cudencuden
  • 13
  • 5
  • 1
    It is not surprising, the regex is a mess. If you replace `(\s|.)+?` (a "regex performance killer") with `.+?` and use `/s` modifier, it will already be much more efficient, but still not that efficient. – Wiktor Stribiżew Feb 20 '17 at 20:48
  • 1
    The regex searches for any text inside spoiler tags while stripping surrounding whitespace. You might be better suited simply capturing all the text inside the spoiler tags and then whitespace stripping it as a seperate step – Will Barnwell Feb 20 '17 at 20:48
  • `.` matches `\s` already, your regex is probably running slow and timing out because its doing an enormous amount of backtracking. – Will Barnwell Feb 20 '17 at 20:50
  • 2
    @WillBarnwell: not exactly `.` matches all except the newline character. – Casimir et Hippolyte Feb 20 '17 at 20:53
  • 2
    Instead of using `preg_match_all`, a `foreach` loop, and `str_replace`, you should use `preg_replace_callback`. – Casimir et Hippolyte Feb 20 '17 at 20:58
  • correction: `.` and `\s` have serious overlap – Will Barnwell Feb 20 '17 at 21:00

1 Answers1

-1

As the comments say, the regex is very inefficient, and is probably the cause - you can do all this with one regex:

\[spoiler(:?\=([^\]]+))?\]((:?[^\[]+|\[(?!=\/spoiler\]))+)\[\/spoiler\]

  • \[spoiler matches [spoiler
  • (:? starts a non capturing group (wont be in $matches)
    • \= matches =
    • ( start first capturing group
    • [^\]]+ matches anything other than a ] char one or more times.
    • ) close first capturing group
  • )? closes group and makes it optional
  • ( start second matching group
    • (:? starts a non capturing group (wont be in $matches)
      • [^\[]+ matches anything other than a [ char one or more times
      • | or
      • \[(?!=\/spoiler\] a single [ char as long as it isn't followed by /spoiler]
    • )+ repeat group one or more times
  • ) close second matching group
  • \[\/spoiler\] matches [/spoiler]

This takes care of both scenarios (just check if $match[1] is empty to tell which) , but it doesn't try ans strip whitespace, you can easily do this in php after using trim($match[1],"\r\n\t ")

Two key things here:

Matching the "header" text

[^\]]+ matches anything other than a [ char one or more times.

This is much faster than looking for any char - which would have caused 'backtracking' see: http://www.regular-expressions.info/catastrophic.html

matching the spoiler contents

  • (:? starts a non capturing group (wont be in $matches)
    • [^\[]+ matches anything other than a [ char one or more times
    • | or
    • \[(?!=\/spoiler\] a single [ char as long as it isn't followed by /spoiler]
  • )+ repeat group one or more times

This is allowing for [ chars inside your spoiler text, but again without causing backtracking. It does this by looking for either a series of non [ chars or a single [ char as long as it isn't immediately followed by /spoiler] - this is a 'negative lookahead' see: http://www.regular-expressions.info/lookaround.html

All that being said, regex is probably not the best tool for the job if you want to do it efficiently, a tokenizer or perhaps preg_split would probably be better.

Theo
  • 1,608
  • 1
  • 9
  • 16
  • do you need the `+` in `[^\[]+`? it seems redundant to the `+` on the outside of that non-capturing group – Will Barnwell Feb 20 '17 at 21:26
  • The first instance is to capture your "header" text, this is better than looking for any character, as this was causing the regex engine to do much more work. The second instance is so that you can still have `[` chars inside your spoilers, so it looks for either anything other than a `[` char , or specifically a `[` character that isn't followed by the rest of the closing tag text. – Theo Feb 20 '17 at 21:35