/\s+$/u performs really bad if string contains many spaces in the middle

Question

Consider this string (notice the horizontal scroll - the string is long):

$content = 'Xxxxxx xx xxxx xxxxxx/xxxx xxxxxxx xx xxxxx xx xxx   XXXXXXX/XXXXX XXXX   XXXXXXX XXXX   XXXXXX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               XXXXX XXXXXX   XXXXXX   XXXXXX XXXXX   XXXXXX';

I have my own mb_trim() function to support unicode strings, but I found it's performing really bad for this string in specific.

After debugging, I realized that it's just the "end-of-string" bit that doesn't perform, while "beginning-of-string" is fine.

So, just doing this (minimal code):

$trim = preg_replace('/\s+$/u', '', $content);

This takes 2s ~ 3s.
But even without the u modifier, it still takes ~1.60s.

If I replace the spaces in the middle with some letter, the preg_replace will take 0s.

Is there a way to fix this performance issue?

It's funny that if I run this:

$trim = preg_replace('/\s{2,}/u', ' ', $content);
$trim = preg_replace('/\s+$/u', '', $trim);

This will run fast.
But I don't understand why are the spaces in the middle of the string a problem for an "end-of-string" regex. I'd think it would be optimized in a way that it would only look at the end of the string and not in the middle.

--

UPDATE - This seems to take the 2s on the server running AlmaLinux (even though it has a very good CPU and RAM) and on a Docker container running CentOS 7 on a Windows. But if I run the script on the Windows itself, it runs instantly. It also runs fast on 3v4l.

I tried on another Linux host running PHP 7.4, and it took 5.4s.

I wonder what could be causing the hang on the Linux systems above?

First preg works fine for me `time php8.1 test.php` -> `Executed in 76.09 millis, usr time 35.12 millis, sys time 22.40 millis` — Marcin Orlowski, Sep 18 '22 at 12:37
Strange.. It takes 3s on my Linux server (huge CPU & RAM) and on my Windows locally. — Nuno, Sep 18 '22 at 12:39
PHP 8.1.7 here. Seems to be pretty fast also on 3v4l, even if I use my version. I'll continue to investigate... — Nuno, Sep 18 '22 at 12:47
Ok. When I say I run on Windows above, I meant a Docker container which runs CentOS 7. The server runs AlmaLinux. Both those take 2+ seconds, whether it's through PHP-FPM or CLI. If I run the same script on my Windows (not in docker), it takes 0s... — Nuno, Sep 18 '22 at 12:52
Have you tried with a [possessive quantifier](https://www.regular-expressions.info/possessive.html): `\s++$` — bobble bubble, Sep 18 '22 at 13:34
Thank you. That's very interesting... `/\s++$/u` seems to run instantly! Never heard about "possessive". I'll have a read. But do you have any idea why the original regex would not perform well on the various Linux systems, but does on Windows and 3v4l? Thank you! — Nuno, Sep 18 '22 at 14:03
@Nuno I don't know why this runs well on windows or 3v4l because there is a whole lot of backtracking even [in much shorter strings already (demo)](https://regex101.com/r/ZpAOGv/1). See the steps counter or click on left side the debugger. Guess these environments where it "performs well" have just set a pcre.backtrack_limit at a low value. — bobble bubble, Sep 18 '22 at 14:09
@bobblebubble - In 3v4l, backtrack_limit = 1000000, same as mine. If I put a backtrack_limit that is too low, the regex fails and I get NULL. Thank you - really appreciate the time. I can see in that demo how using `++` reduces the steps a lot! — Nuno, Sep 18 '22 at 14:29
I'm surprised that there appears to be no PHP [wrapper for RE2](https://github.com/google/re2/wiki/Install) — jhnc, Sep 18 '22 at 16:11
@jhnc - interesting! Didn't know about it. Thank you. But based on its description, it might not fix the issue discussed here. https://github.com/google/re2/wiki/WhyRE2 — Nuno, Sep 19 '22 at 18:34
RE2 would definitely help here: https://regex101.com/r/eb4RUo/1 — jhnc, Sep 19 '22 at 22:41
@Nuno I just stumpled upon [this blogpost](https://mamchenkov.net/wordpress/2016/07/21/the-regex-that-killed-stackoverflow/) that seems related to your question :) Looks like you're not the only one having struggled with this! — bobble bubble, Oct 08 '22 at 18:52
Ah, nice! That was exactly the same thing! :) thanks for sharing this. — Nuno, Oct 09 '22 at 19:11

bobble bubble · Accepted Answer · 2022-11-12T01:31:22.830

Late comes my answer! The bad performance is a result from a lot of backtracking if the input contains many whitespaces which are not at the end. Looks like even Stack Overflow faced some similar problem. I stumbled upon this old blogpost recently: The RegEx that killed StackOverflow

The most obvious and first idea to use a possessive quantifier for reducing backtracking:

\s++$

Using this demo-input at regex101 it shows almost hundert times fewer steps than without.

With a little trick to let consume everything until the last \S non-whitespace even better:

^(?s)(?>.*\S\K)?\s+$

Getting it down to just 20 steps from initially more than 200k steps with the demo input.

The s flag (single line) makes the dot also match newlines
(?> atomic group ) for failing fast if no whitespace at the end
\K resets beginning of the reported match after the last \S

Here a little PHP benchmark at tio.run for comparing the performance of all three variants.
An alternative for other regex flavors can be to replace ^([\S\s]*\S)?\s*$ with $1 (or \1).

/\s+$/u performs really bad if string contains many spaces in the middle

1 Answers1