Different results for unicode/multibyte modifier and mb_ereg_replace

Question

This regex seems to be very problematic:

(((?!a).)*)*a\{

_{I know the regex is terrible. That is not the question here.}

when tested with this string:

AAAAAAAAAAAAAA{AA

The letters A and a could be replaced with pretty much anything and result in the same problem.

This regex and test string pair is condensed. The full example can be found here.

This is the code that I used to test:

<?php
$regex = '(((?!a).)*)*a\\{';
$test_string = 'AAAAAAAAAAAAAA{AA';
echo "1:".mb_ereg_replace('/'.$regex.'/','WORKED',$test_string)."\n";
echo "2:".preg_replace('/'.$regex.'/u','WORKED',$test_string)."\n";
echo "3:".preg_replace('/'.$regex.'/','WORKED',$test_string)."\n";

The results can be viewed here:

http://3v4l.org/Yh6FU

The ideal result would be that the same test string is returned because the regex does not match.

When using preg_replace with the u modifier, it should have the same results as mb_ereg_replace according to this comment:

php multi byte strings regex

mb_ereg_replace works exactly as it should. It returns the test string because the regex does not match.

However, preg_replace for PHP versions other than 4.3.4 - 4.4.5, 4.4.9 - 5.1.6 does not seem to work.

For some PHP versions, the result is an error:

Process exited with code 139.
For some other PHP versions, the result is NULL
For the rest, mb_ereg_replace had not yet been made

Also, removing just a single letter from either the string or the regex seems to completely alter which versions of PHP have which results.

Judging from this comment:

php multi byte strings regex

ereg* should be avoided, which makes sense since it is slower and supports less than preg* does. This makes using mb_ereg_replace undesirable. However, there is not a mb_preg_replace option, so this seems to be the only option that works.

So, my question is:
Is there any alternative to mb_ereg_replace that would work correctly for the given string and regex pair?

I don't think the problem has anything to do with the fact that you are using `mb_ereg_replace()` or whatever. I believe it has to do with [catastrophic backtracking](http://www.regular-expressions.info/catastrophic.html), since RegEx101's debugger shows 119,991 steps before giving up...I'm sure each version of PHP implements it slightly different and you get unexpected results. My point is: *what do you* ***want*** *to match?* — Sam, May 28 '14 at 01:43
@Sam This is part of a *much* larger regex that has the same error. I just narrowed it down to the part causing the error. There is no "catastrophic backtracking" in the actual. — Anonymous, May 28 '14 at 01:45
I really beg to disagree (but you can show the whole regex to me). Remove the backtracking ([`((?!a).*)a\{`](http://regex101.com/r/pP2iW8) brings the number of steps to 79, .06% of the steps), and the [results are all normalized](http://3v4l.org/umOa4). — Sam, May 28 '14 at 01:50
I'm still having the catastrophic backtracking issues. The `((...)*)*` is a prime example of what can cause these problems, and it gets even worse when your expression hits an `a` because of the negative lookahead in this pattern. Can you please update the question with what you want to match and I can do my best to show you that a different expression will accomplish what you want with consistency across functions/versions of PHP. — Sam, May 28 '14 at 02:08
@Sam Sorry, I used the wrong regex in the test example. I will update that when I have a free moment. — Anonymous, May 28 '14 at 10:40
@Sam The actual regex and string pair has been added to the question. Sorry about that. — Anonymous, May 28 '14 at 19:07
There's still catastrophic backtracking (after a few matches) on this example. I give up on this, but if you want to explain what you are trying to match I will create a cleaner regex. And I will make sure it works with the `u` modifier exactly the same as `mb_ereg_replace()`, like you've asked. — Sam, May 28 '14 at 19:37
Basically, it tries to match something such as `if (test(testing) == 'test') {` or `while (test) {` but not `
(test) {` nor `while ((test) {`. — Anonymous, May 28 '14 at 19:46
@Jerry It was wrong because I would imagine that it would match the second `(` and first `)`, but I wanted it to match the full parentheses group. So, I guess it isn't necessarily wrong, but it wouldn't be easy to account for in the regex. Maybe I should have instead said not `while (test)) {`. — Anonymous, May 29 '14 at 19:28
Hmm, what about a regex like [this one](http://regex101.com/r/qL3oV2)? — Jerry, May 29 '14 at 19:37
@Jerry Yes, that works perfectly. Thank you. I guess I was just over complicating it. — Anonymous, May 29 '14 at 19:43
Well, that's not a simple regex (it uses recursion) and I don't know how it works with your whole data. Usually, regex is not really appropriate for html, but if you're ready to take whatever error you might get and the headache that might also accompany those, then go ahead :) — Jerry, May 29 '14 at 19:51
@Jerry I'm completely aware of all those "Don't use regex for HTML" arguments, but I find it to be a challenge, so I am trying it. It's not meant to be perfect, but works correctly most of the time. If it were for something more important, then yes, I would avoid regex. — Anonymous, May 29 '14 at 19:54

Mofi · Accepted Answer · 2014-05-30T12:56:07.007

Do you know the difference between (...) and (?:...)?

(...) ... this defines a marking group. The string found by the expression within the round brackets is internally stored in a variable for back referencing.

(?:...) ... this defines a non marking group. The string found by the expression within the parentheses is not internally stored. Such a non marking group is often used to apply an expression several times on a string.

Now let us take a look on your expression (((?!a).)*)*a\{ which on usage in a Perl regular expression find in text editor UltraEdit results in the error message The complexity of matching expression has exceeded available resources.

(?!a). ... a character should be found where next character is not letter 'a'. Okay. But you want find a string with 0 or more characters up to letter 'a'. Your solution is: ((?!a).)*)

That is not a good solution as the engine has now on each character to lookahead for letter 'a', and if the next character is not an 'a', match the character, store it as a string for back referencing and then continue on next character. Actually I don't even know what happens internally when a multiplier is used on a marking group as done here. A multiplier should be never used on a marking group. So better would be (?:(?!a).)*.

Next you extend the expression to (((?!a).)*)*. One more marking group with a multiplier?

It looks like you want mark the entire string not containing letter 'a'. But in this case it would be better to use: ((?:(?!a).)*) as this defines 1 and only 1 marking group for the string found by the inner expression.

So the better expression would be ((?:(?!a).)*)a\{ as there is now only 1 marking group without a multiplier on the marking group. Now the engine knows exactly which string to store in a variable.

Much faster would be ([^a]*?)a\{ as this non greedy negative character class definition matches also a string of 0 or more characters left of a{ not containing letter 'a'. Look ahead should be avoided if not necessary as this avoids backtracking.

I don't know the source code of the PHP functions mb_ereg_replace and preg_replace which would be needed to be examined with the expression step by step to find out what exactly is the reason for the different results.

However, the expression (((?!a).)*)*a\{ results definitely in a heavy recursion as it is not defined when to stop matching data and what to store temporarily. So both functions (most likely) allocate more and more memory from stack and perhaps also from heap until either a stack overflow or a "not enough free memory" exception occurs.

Exit code 139 is a segmentation fault (memory boundary violation) caused by a not caught stack overflow, or NULL was returned on allocating more memory from heap with malloc() and the return value NULL was ignored. I suppose, returning NULL by malloc() is the reason for exit code 139.

So the difference makes most like the error respectively exception handling of the two functions. Catching a memory exception or counting the recursive iterations with an exit on too many of them to prevent a memory exception before it really occurs could be the reason for the different behavior on this expression.

It is hard to give a definite answer what makes the difference without knowing source code of the functions mb_ereg_replace and preg_replace, but in my point of view it does not really matter.

The expression (((?!a).)*)*a\{ results always in a heavy recursion as Sam has reported already in his first comment. More than 119000 steps (= function calls) during a replace on a string with just 17 characters is a strong sign for something is wrong with the expression. The expression can be used to let the function or entire application (PHP interpreter) run into abnormal error handling, but not for a real replace. So this expression is good for the developers of the PHP functions for testing error handling on an endless recursion, but not for a real replace operation.

The full regular expression as used in referenced PHP sandbox:

(?<!<br>)(?<!\s)\s*(\((?:(?:(?!<br>|\(|\)).)*(?:\((?:(?!<br>|\(|\)).)*\))?)*?\))\s*(\{)

It is hard to analyze this search string in this form.

So let us look on the search string like it would be a code snippet with indentations for better understanding the conditions and loops in this expression.

(?<!<br>)(?<!\s)\s*
(
   \(
   (?:
      (?:
         (?!<br>|\(|\)).
      )*
      (?:
         \(
         (?:
            (?!<br>|\(|\)).
         )*
         \)
      )?
   )*?
   \)
)
\s*
(\{)

I hope, it is now easier to see the recursion in this search string. There is twice the same block, but not in sequence order, but in nested order, a classic recursion.

And additionally all the expressions including the nested expressions forming a recursion before the final (\{) which can match any character are with the multipliers * or ? which mean can exist, but must not exist. The presence of { is the only real condition for the entire search string. Everything else is optional and this is not good because of the recursion in this search string.

It is very bad for a recursive search expression if it is completely unclear where to start and where to stop selecting characters as it results in an endless recursion until abnormal exit.

Let me explain this problem with a simple expression like [A-Za-z]+([a-z]+)

1 or more letters in upper or lower case followed by 1 or more characters in lower case (and case-sensitive search is enabled). Simple, isn't it.

But the second character class defines a set of characters which is a subset of the set of characters defined by the first class definition. And this is not good.

What should be tagged by the expression in parentheses on a string like York?

ork or rk or just k or even nothing because no matching string found as the first character class can match already the entire word and therefore nothing left for second character class?

The Perl regular expression library solved such this common problem by declaring the multipliers * and + by default as greedy except ? is used after a multiplier which results in the opposite matching behavior. Those 2 additional rules help already on this problem.

Therefore the expression as used here marks only k and with [A-Za-z]+?([a-z]+) the string ork is marked and with [A-Za-z]+?([a-z]+?) just first o is marked.

And there is one more rule: favor a positive result over a negative result. This additional rule avoids that the first character class selects already the entire word York.

So main problem with partly or completely overlapping sets of characters solved.

But what happens if such an expression is put in a recursion and making it even more complex by using lookahead / lookbehind and backtracking, and backtracking is done not only by 1 character, but even by multiple characters?

Is it still clearly defined where to start and stop selecting characters for every expression part of the entire search string?

No, it is not.

With a search string where there is no clear rule which part of a search string is selected by which part of the search expression, every result is more or less valid including the unexpected ones.

And additionally it can happen easily because of the missing start/stop conditions that the functions fail completely to apply the expression on a string and exit abnormal.

An abnormal exit on applying a search string is surely always an unexpected result for the human who used the search expression.

Different versions of a search functions may return different results on an expression which let the search functions run into an abnormal function exit. The developers of the search functions continuously change the program code of the search functions to better detect and handle search expressions resulting in an endless recursion as this is simply a security problem. A regular expression search allocating more or more memory from application's stack or entire RAM is very problematic for the security, stability and availability of the entire machine on which this application is running. And PHP is used mainly on servers which should not stop working because a recursive memory allocation occupies more and more RAM from the server as this would finally kill the entire server.

This is the reason why you get different results depending on the used PHP version.

I looked very long on your complete search expression and let it run several times on the example string. But honestly I could not find out what should be found and what should be ignored by the expression left of (\{).

I understand parts of the expression, but why is there a recursion in the search string at all?

What is the purpose of the negative lookbehind (?<!\s) on \s*?

\s* matches 0 or more white-spaces and therefore the purpose for the expression "previous character not being a whitespace" is not comprehensible for me. The negative lookbehind is simply useless in my point of view and just increases the complexity of the entire expression. And this is just the beginning.

I am quite sure that what you really want can be achieved with a much simpler expression not having a recursion resulting a abnormal function exits depending on searched string and with all or nearly all backtracking steps removed.

My mistake. It seems I used the wrong full example regex. I will update that when I can. However, the question is still about why mb_ereg_replace is able to work on some things that preg_replace with the u modifier does not. — Anonymous, May 28 '14 at 10:37
The actual regex and string pair has been added to the question as a reference. — Anonymous, May 28 '14 at 19:07
Thank you. This was very helpful. Just since you asked, I was trying to match something like `if (this) {`, `if (this(that)) {`, etc. And, the reason for `(?<!
)(?<!\s)\s*` was because `(?<!
)\s*` was matching something such as `
` considering the last space to be valid because it was a space and was not preceded by a `
`. Thus, it was not getting all the spaces beforehand. There were reasons for it to be so complicated, but I see why that wouldn't work too well now. — Anonymous, May 29 '14 at 19:01
The proper term by the way is not recursion, it's backtracking. And in this specific case, heavy backtracking == catastrophic backtracking. Recursion in regex is about `(?R)` and the like. — Jerry, May 29 '14 at 19:14
@Jerry, you are right with term `backtracking` versus term `recursion`. But backtracking is implemented in programming languages using recursive procedures. Take a look on pseudo code example at Wikipedia article about [backtracking](http://en.wikipedia.org/wiki/Backtracking). Parsing an expression is done always with recursive procedure calls independent on expression is a mathematical formula like (5+3)*10^2 or a regular search expression. — Mofi, May 30 '14 at 12:20
@Anonymous, I better understand now your intention. What about using the expression `(?<!
)(?<!\s)\s*(\((?:(?:[^);{](?!
))*)\))\s*(\{)` which is not perfect, but I think works quite well. — Mofi, May 30 '14 at 12:54
@Mofi Thanks, but that does not match `if (this(that)) {`. Don't worry though. Jerry gave a regex that does the trick: http://stackoverflow.com/questions/23861425/different-results-for-unicode-multibyte-modifier-and-mb-ereg-replace/23903723?noredirect=1#comment36877420_23861425 — Anonymous, May 30 '14 at 18:55

Different results for unicode/multibyte modifier and mb_ereg_replace

1 Answers1

Linked