1

I am using preg_match to find and remove evaled base64 encoded viruses within files.

the regex bewlow:

/\s*eval\s*\(\s*base64_decode\s*\(\s*('[a-zA-Z0-9\+\/]*={0,2}'|"[a-zA-Z0-9\+\/]*={0,2}")\s*\)\s*\s*\)\s*(;)?\s*/

matches the following code:

eval(base64_decode("BASE64+ENCODED+VIRUS+HERE")); 

The above regex works fine.

I wanted to match base64 strings word-wrapped via concatenations. So it should match the following as well "BASE64+EN" . "CODED+VIRUS+HERE".

So I changed the regex into:

/\s*eval\s*\(\s*base64_decode\s*\(\s*\'([a-zA-Z0-9\+\/]*(\'\s*\.\s*\')?[a-zA-Z0-9\+\/]*)*={0,2}\'|"([a-zA-Z0-9\+\/]*("\s*\.\s*")?[a-zA-Z0-9\+\/]*)*={0,2}"\s*\)\s*\s*\)\s*(;)?\s*/

Which finds a partial match for:

"BASE64+ENCODED+VIRUS+HERE"));

But when I try to apply the match on this entire file: http://pastebin.com/ED8sFUP0 the page dies with browser message "The connection to the server was reset while the page was loading.".

I have error reporting activated:

error_reporting(E_ALL);
ini_set('display_errors', TRUE);
ini_set('scream.enabled', TRUE);

But nothing shows up not here and not in apache's error logs either.

The very same regex when used on files that do not contain the offending string works as expected; preg_match does not return boolean false it returns 0 meaning that there was no regex error and that it did not find any matches.

My concern is not necessarily why the regex finds only a partial match. That's probably some typo I made that happens to work.

I want to know when and how does the regex compiler fail break the entire process chain

apache > php > regex_compiler

I understand that it may very well be "because" of my regex that just happens compile correctly but not match correctly. And it might cause something bad down the road. But my interest is why the regex compiler fails with no error and how I can get the error message that is should be yielding.

Something similar is discussed but unresolved here: php preg_match_all kills page for unknown reason

Community
  • 1
  • 1
Mihai Stancu
  • 15,848
  • 2
  • 33
  • 51
  • I [answered the question you linked](http://stackoverflow.com/a/10643701/626273). I think you have a similar problem, but I still try to understand your regex. – stema May 17 '12 at 21:23

2 Answers2

1

edit:

 \s*
 eval \s*
 \( \s*
    base64_decode \s* 
    \( \s* 
        (?:
            (?>
               '
                 [a-zA-Z0-9+/]*
                 (?:
                    '
                      \s* \. \s*
                    '
                    [a-zA-Z0-9+/]*
                 )*
                 ={0,2}
               '
            )
          |
            (?>
               "
                 [a-zA-Z0-9+/]*
                 (?:
                    "
                      \s* \. \s*
                    "
                    [a-zA-Z0-9+/]*
                 )*
                 ={0,2}
               "
            )
        )
        \s*

    \)\s*

 \)\s* ;? \s*

How to handle "".'' catenation

Your not trying to parse the language (you couldn't do that with this), so you can
handle catenation conditions "".'' with this very fast regex...

~
 \s*
 eval \s*
 \( \s*
    base64_decode
    \s* 
    \(
       \s* 
        ["']
        (?> [a-zA-Z0-9+/]* (?: ["']\s*\.\s*["'] [a-zA-Z0-9+/]* )* )
        ={0,2}
        ["']
       \s*
    \)
    \s*
 \)
 \s* ;? \s*

~x
  • This is going to be fun, @stema fixed the gaping whole in performance, you probably fixed the one in accuracy. I haven't tested your solution yet. My code generates the regex based on a number of combinations of eval+base64+gzinflate/gzuncompress/bzdecompress+str_rot13 and it also takes into account strings hidden under ascii-hexcodes/unicode-hexcodes. Which all makes incorporating your solution difficult. Hence i'll be doing it in the morning. – Mihai Stancu May 17 '12 at 23:25
  • Since your code retains the performance issue and my question was "why is my regex crashing", I think I'll accept @stema's answer for "Catastrophic Backtracking". – Mihai Stancu May 17 '12 at 23:35
  • @Mihai Stancu - No problem glad you got it working. I would have thought going from `([a-zA-Z0-9\+\/]*(\'\s*\.\s*\')?[a-zA-Z0-9\+\/]*)*` to `[a-zA-Z0-9+/]*(?:'\s*\.\s*'[a-zA-Z0-9+/]*)*` would not cause too much backtracking on FAIL. I loaded up your file to 2.4 megabytes, inserted a '=' sign near the end (but invalid). It took 1/2 second to fail, yours just hung. So I've added atomic groupings and now it takes 1/4 second to fail. The code is in my edit. I've also posted your regex expanded and a problem. - Good luck! –  May 18 '12 at 01:54
  • One other thing, whats to stop catenation from using both forms `"".''` ? –  May 19 '12 at 00:03
  • We have two separate forms of quoted (and quote splitted) base64 code. This one **"([a-zA-Z0-9\+\/]*("\s*\.\s*")?[a-zA-Z0-9\+\/]*)*={0,2}"** is for double quoted base64 code that can be split with double qoutes. And there's another one that is single quoted and can be split with single quotes. – Mihai Stancu May 19 '12 at 10:56
  • I just re-read your question. Yeah, you're right it can be done. But a lot of things that can be done are not taken into consideration in my code yet. If I really wanted to do this right I'd write a PHP parser to find the "offending code" in the abstract syntax tree. But until then I'll incorporate what my imagination can. – Mihai Stancu May 19 '12 at 11:01
  • @Mihai Stancu - I imagine allowing for both catenations does not have to be in the scope of a language parser. I've included a how-to regex above. In reality, anything that returns a string to the catenation operator is allowed. This includes function calls. –  May 19 '12 at 17:53
  • What I meant is that some clever techniques to hide offending code cannot be found in the sourcecode regex, only at run-time. Such as creating 3 variables $a = 'eval'; $b = 'base64_decode'; $code = array(43, 44, 58, 93, ...); and then decoding the array like chr($code[$i]+53); and executing it like preg_replace('//e', "$a($b($code))", ''); – Mihai Stancu May 19 '12 at 22:14
0

I think your regex has to many possibilities to match ==> Catastrophic Backtracking.

/\s*eval\s*\(\s*base64_decode\s*\(\s*\'([a-zA-Z0-9\+\/]*(\'\s*\.\s*\')?[a-zA-Z0-9\+\/]*)*={0,2}\'|"([a-zA-Z0-9\+\/]*("\s*\.\s*")?[a-zA-Z0-9\+\/]*)*={0,2}"\s*\)\s*\s*\)\s*(;)?\s*/
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The regex will need a lot of steps to match the part I marked ==> you have a performance problem, the regex just don't finish in time!

Since (\'\s*\.\s*\')? is optional you need a lot of steps till the regex figured out what to match with the [a-zA-Z0-9\+\/]* before and the same thing after the optional part.

What you can do is to use possessive quantifiers (you make a quantifier possessive by adding a + after it). They prevent from backtracking and the possessive quantifier does not give back a character that it matched. So, try this

/\s*eval\s*\(\s*base64_decode\s*\(\s*\'([a-zA-Z0-9\+\/]*+(\'\s*\.\s*\')?[a-zA-Z0-9\+\/]*+)*={0,2}\'|"([a-zA-Z0-9\+\/]*+("\s*\.\s*")?[a-zA-Z0-9\+\/]*+)*={0,2}"\s*\)\s*\s*\)\s*(;)?\s*/
                                                       ^^                               ^^                           ^^                            ^^
stema
  • 90,351
  • 20
  • 107
  • 135
  • Brilliant mate, it solved the performance problem indeed. I guessed it had something to do with the 260k file of base64 encoded virus I was feeding it. I just didn't think about it in the "time" domain, I thought of it in the memory domain. – Mihai Stancu May 17 '12 at 23:18
  • I had thought of using [a-zA-z]*? for a lazy match (as a performance enhancement) but after I tested it and saw no change I forgot backtracking takes time! – Mihai Stancu May 17 '12 at 23:19