2

Consider this string (notice the horizontal scroll - the string is long):

$content = 'Xxxxxx xx xxxx xxxxxx/xxxx xxxxxxx xx xxxxx xx xxx   XXXXXXX/XXXXX XXXX   XXXXXXX XXXX   XXXXXX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               XXXXX XXXXXX   XXXXXX   XXXXXX XXXXX   XXXXXX';

I have my own mb_trim() function to support unicode strings, but I found it's performing really bad for this string in specific.

After debugging, I realized that it's just the "end-of-string" bit that doesn't perform, while "beginning-of-string" is fine.

So, just doing this (minimal code):

$trim = preg_replace('/\s+$/u', '', $content);

This takes 2s ~ 3s.
But even without the u modifier, it still takes ~1.60s.

If I replace the spaces in the middle with some letter, the preg_replace will take 0s.

Is there a way to fix this performance issue?

It's funny that if I run this:

$trim = preg_replace('/\s{2,}/u', ' ', $content);
$trim = preg_replace('/\s+$/u', '', $trim);

This will run fast.
But I don't understand why are the spaces in the middle of the string a problem for an "end-of-string" regex. I'd think it would be optimized in a way that it would only look at the end of the string and not in the middle.

--

UPDATE - This seems to take the 2s on the server running AlmaLinux (even though it has a very good CPU and RAM) and on a Docker container running CentOS 7 on a Windows. But if I run the script on the Windows itself, it runs instantly. It also runs fast on 3v4l.

I tried on another Linux host running PHP 7.4, and it took 5.4s.

I wonder what could be causing the hang on the Linux systems above?

Nuno
  • 3,082
  • 5
  • 38
  • 58
  • First preg works fine for me `time php8.1 test.php` -> `Executed in 76.09 millis, usr time 35.12 millis, sys time 22.40 millis` – Marcin Orlowski Sep 18 '22 at 12:37
  • Strange.. It takes 3s on my Linux server (huge CPU & RAM) and on my Windows locally. – Nuno Sep 18 '22 at 12:39
  • FYI: PHP 8.1.10 (cli) (built: Sep 18 2022 10:26:02) (NTS) – Marcin Orlowski Sep 18 '22 at 12:43
  • PHP 8.1.7 here. Seems to be pretty fast also on 3v4l, even if I use my version. I'll continue to investigate... – Nuno Sep 18 '22 at 12:47
  • Ok. When I say I run on Windows above, I meant a Docker container which runs CentOS 7. The server runs AlmaLinux. Both those take 2+ seconds, whether it's through PHP-FPM or CLI. If I run the same script on my Windows (not in docker), it takes 0s... – Nuno Sep 18 '22 at 12:52
  • 1
    Have you tried with a [possessive quantifier](https://www.regular-expressions.info/possessive.html): `\s++$` – bobble bubble Sep 18 '22 at 13:34
  • Thank you. That's very interesting... `/\s++$/u` seems to run instantly! Never heard about "possessive". I'll have a read. But do you have any idea why the original regex would not perform well on the various Linux systems, but does on Windows and 3v4l? Thank you! – Nuno Sep 18 '22 at 14:03
  • @Nuno I don't know why this runs well on windows or 3v4l because there is a whole lot of backtracking even [in much shorter strings already (demo)](https://regex101.com/r/ZpAOGv/1). See the steps counter or click on left side the debugger. Guess these environments where it "performs well" have just set a pcre.backtrack_limit at a low value. – bobble bubble Sep 18 '22 at 14:09
  • 1
    @bobblebubble - In 3v4l, backtrack_limit = 1000000, same as mine. If I put a backtrack_limit that is too low, the regex fails and I get NULL. Thank you - really appreciate the time. I can see in that demo how using `++` reduces the steps a lot! – Nuno Sep 18 '22 at 14:29
  • I'm surprised that there appears to be no PHP [wrapper for RE2](https://github.com/google/re2/wiki/Install) – jhnc Sep 18 '22 at 16:11
  • @jhnc - interesting! Didn't know about it. Thank you. But based on its description, it might not fix the issue discussed here. https://github.com/google/re2/wiki/WhyRE2 – Nuno Sep 19 '22 at 18:34
  • RE2 would definitely help here: https://regex101.com/r/eb4RUo/1 – jhnc Sep 19 '22 at 22:41
  • 1
    @Nuno I just stumpled upon [this blogpost](https://mamchenkov.net/wordpress/2016/07/21/the-regex-that-killed-stackoverflow/) that seems related to your question :) Looks like you're not the only one having struggled with this! – bobble bubble Oct 08 '22 at 18:52
  • 1
    Ah, nice! That was exactly the same thing! :) thanks for sharing this. – Nuno Oct 09 '22 at 19:11

1 Answers1

3

Late comes my answer! The bad performance is a result from a lot of backtracking if the input contains many whitespaces which are not at the end. Looks like even Stack Overflow faced some similar problem. I stumbled upon this old blogpost recently: The RegEx that killed StackOverflow

The most obvious and first idea to use a possessive quantifier for reducing backtracking:

\s++$

Using this demo-input at regex101 it shows almost hundert times fewer steps than without.


With a little trick to let consume everything until the last \S non-whitespace even better:

^(?s)(?>.*\S\K)?\s+$

Getting it down to just 20 steps from initially more than 200k steps with the demo input.

  • The s flag (single line) makes the dot also match newlines
  • (?> atomic group ) for failing fast if no whitespace at the end
  • \K resets beginning of the reported match after the last \S

Here a little PHP benchmark at tio.run for comparing the performance of all three variants.
An alternative for other regex flavors can be to replace ^([\S\s]*\S)?\s*$ with $1 (or \1).

bobble bubble
  • 16,888
  • 3
  • 27
  • 46