2

I'm trying to match file contents against anti-virus signatures with help of PHP regex, but I'm having problem with:

preg_match(): Compilation failed: regular expression is too large at offset 107

Patterns that fail typically looks like this:

75633d617313134(?:..){0,27615}75626f756e687228756328692929

I've tried various modifications with help of https://regex101.com/, but without success. I still get same error when I reduce the pattern to simply:

(?:.){0,4000}

Can someone explain why? From my readings on this forum the limit should be ~65000? And why is it working if I change the number of matches to {0,}?

My server is running Apache with PHP 7.2.7. PCRE library version is 8.42 (pcre.backtrack_limit: 1000000, pcre.recursion_limit: 100000).

The original patterns are coming from ClamAV's anti-virus database, which supposedly are designed for the regex.c library. To get them working with PHP/PCRE a conversion is needed, hence it is not possible to manually re-write each pattern. To re-compile PHP to increase PCRE LINK_SIZE is not an option due to shared web hosting.

Currently preg_replace is used with ~\{([0-9]+)-([0-9]+)\}~, replacing the match with (?:..){\1,\2}.

My original question was to understand how PCRE could come to conclusion that even the simplified statement above is too big. But ultimately the final target is to get the pattern changed/fixed to work for its intended purpose.

The post "Why am I being warned that my regular expression is too large?" is somewhat explaining parts of this but not fully finding the root cause/solution.

  • Are you trying to capture (up to) 27615 named groups? – steffen Oct 06 '18 at 22:47
  • @steffen: `(?:...)` is a non-capturing group, not a named group. – Casimir et Hippolyte Oct 06 '18 at 22:54
  • @CasimiretHippolyte oh right. anyways.. – steffen Oct 06 '18 at 22:55
  • Please show us the exact pattern. – Casimir et Hippolyte Oct 06 '18 at 22:55
  • I should also mention that the original pattern comes from ClamAV, which I believe are made for the regex.c library. To get it working with pcre there is this conversion: `~\{([0-9]+)-([0-9]+)\}~` to: `(?:..){\1,\2}` – Joakim Tallinger Oct 06 '18 at 22:56
  • Exact pattern to test with: `if (!preg_match('/(?:75633d61727261792831332c313134(?:..){0,27615}75626f756e64287563293a7563733d7563732663687228756328692929)/i', 'abc')) { echo 1; die; }` – Joakim Tallinger Oct 06 '18 at 22:58
  • I think Blackhole linked question has the good explanation. To avoid the problem, you can play with power of 2 like this: `preg_match('/(?:75633d61727261792831332c313134(?:.{32}){0,1725}(?:..){0,15}75626f756e64287563293a7563733d7563732663687228756328692929)/i', 'abc')` *(`16*1725+15=27615`)* – Casimir et Hippolyte Oct 06 '18 at 23:56
  • @CasimiretHippolyte I've updated the main post with more details why this solution would be hard to implement. Mainly due to the huge amount of patterns (tens of thousands and new ones added on daily basis). What I don't understand is the difference between PCRE and ClamAV's engine (regex.c?) and why even quite simple statement like `(?:.){0,4000}` is failing. Is PCRE somehow internally mapping and re-structuring it and that allocates >65k characters? What is the difference if I simply replace it with {0,}, would I still catch the same findings? – Joakim Tallinger Oct 07 '18 at 00:05

0 Answers0