1

Can someone help me out with creating a regex expression in PHP to parse out the different fields within an Akamai access log. The first line below specifies the field names. Thanks!

#Fields: date time cs-ip cs-method cs-uri sc-status sc-bytes time-taken cs(Referer) cs(User-Agent) cs(Cookie) x-custom
2011-08-08  23:59:52    63.555.254.85   GET /somedomain/images/banner_320x50.jpg    200 10801   0   "http://somerefered.com"    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8G4" "-" "-"
VinnyD
  • 3,500
  • 9
  • 34
  • 48

2 Answers2

1

Looks like the fields are tab delimted. If so you don't need regex but just can do:

$fieldnames = array('date', 'time', 'cs-ip', 'cs-method', 'cs-uri', 'sc-status', 'sc-bytes', 'time-taken', 'cs(Referer)', 'cs(User-Agent)', 'cs(Cookie)', 'x-custom');

$parsed = array();
foreach($lines as $line) {
    $fields = explode("\t", $line);
    foreach($fields as $index => $field) {
        $tmp = array();
        $tmp[$fieldnames[$index]] = $field;
    }

    $parsed[] = $tmp;
}

Now you will have a nice array with the fieldnames as keys.

PeeHaa
  • 71,436
  • 58
  • 190
  • 262
  • The OP asked for a regex-based solution. Also, this will break if one of the quoted fields contains a tab character. – Asaph Aug 12 '11 at 20:59
  • @Asaph: you're right that OP asked for regex. However considering your rep you will also know that many people here on SO will ask for a regex pattern for something when it's not needed / wanted. Also there won't be any tab characters in the quoted fields. Although if the file isn't tab delimited I will bow my head in shame, delete my answer, apologize for wasting everyone's time and open another beer :-) – PeeHaa Aug 12 '11 at 21:09
  • It's true that regexes are often abused. This case is borderline. I could go either way on it. The log line does look tab delimited. I'm not convinced that the quoted fields will never contain tabs. Many of them are supplied by the client via HTTP headers. What if someone makes a request to the web server with a user agent string that has a tab in it? What will appear in the log file? – Asaph Aug 12 '11 at 21:49
  • thanks, I agree that this is a borderline case and would feel more confident with a properly written regex than using explode. – VinnyD Aug 12 '11 at 22:16
  • @Asaph: Hmmm... you may have a valid point (interesting at least). Although I wonder if that is even allowed / will work (adding a tab character in the user-agent string). rfc1945 isn't really clear on the topic. Will do some tests tomorrow. – PeeHaa Aug 12 '11 at 22:20
1

Here is a quick little test program I just wrote:

<?php
// Fields: date time cs-ip cs-method cs-uri sc-status sc-bytes time-taken cs(Referer) cs(User-Agent) cs(Cookie) x-custom
$logLine = '2011-08-08  23:59:52    63.555.254.85   GET /somedomain/images/banner_320x50.jpg    200 10801   0   "http://somerefered.com"    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8G4" "-" "-"';
$regex = '/^(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})\s+(\d{1,3}(?:\.\d{1,3}){3})\s+([A-Za-z]+)\s+(\S+)\s+(\d{3})\s+(\d+)\s+(\d+)\s+"([^"]*)"\s+"([^"]*)"\s+"([^"]*)"\s+"([^"]*)"$/';

$matches = array();
if (preg_match($regex, $logLine, $matches)) {
    $logParts = array(
        'date' => $matches[1],
        'time' => $matches[2],
        'cs-ip' => $matches[3],
        'cs-method' => $matches[4],
        'cs-uri' => $matches[5],
        'sc-status' => $matches[6],
        'sc-bytes' => $matches[7],
        'time-taken' => $matches[8],
        'cs(Referer)' => $matches[9],
        'cs(User-Agent)' => $matches[10],
        'cs(Cookie)' => $matches[11],
        'x-custom' => $matches[12]
    );
    print_r($logParts);
}
?>

This outputs:

Array
(
    [date] => 2011-08-08
    [time] => 23:59:52
    [cs-ip] => 63.555.254.85
    [cs-method] => GET
    [cs-uri] => /somedomain/images/banner_320x50.jpg
    [sc-status] => 200
    [sc-bytes] => 10801
    [time-taken] => 0
    [cs(Referer)] => http://somerefered.com
    [cs(User-Agent)] => Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8G4
    [cs(Cookie)] => -
    [x-custom] => -
)
Asaph
  • 159,146
  • 25
  • 197
  • 199