0

I have about 30,000 Apache access logs, some of which list multiple client IP addresses. This is as a result of Apache logging the X-Forwarded-For header instead of the IP address of the client. The reason that was done is because we recently added haproxy in front of the web servers.

Going forward, we will be using rpaf for Apache to log only 1 IP address, i.e. that of the incoming connection to haproxy, so this will not be an ongoing problem.

Which brings me to the actual question:

How can I process the existing logs with multiple IP addresses, to extract only the one that I want. I am assuming I'd need sed or something similar, but I'm more of a Windows guy, so not 100% sure.

The rules are:

  • If there's only 1 IP, the line is not modified.
  • If there are 2 or more IPs, I only want to keep the second-to-last IP. They are comma-separated.

Example 1, 1 IP

Input: 10.1.1.1 - - [29/Jan/2010:11:00:00] .... (rest of log line)

Output: 10.1.1.1 - - [29/Jan/2010:11:00:00] .... (rest of log line)

Example 2, 2 IPs

Input: 10.1.1.1, 10.2.2.2 - - [29/Jan/2010:11:00:00] .... (rest of log line)

Output: 10.1.1.1 - - [29/Jan/2010:11:00:00] .... (rest of log line)

Example 3, 3 IPs

Input: 10.1.1.1, 10.2.2.2, 10.3.3.3 - - [29/Jan/2010:11:00:00] .... (rest of log line)

Output: 10.2.2.2 - - [29/Jan/2010:11:00:00] .... (rest of log line)

ThatGraemeGuy
  • 15,473
  • 12
  • 53
  • 79

2 Answers2

7

This could be acheived by running this sed command on your logs:

sed -i "s/^\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+, \)*\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+\), [0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+ - -/\2 - -/"

Some explanations:

  • The general format is s/MATCH PATTERN/REPLACE PATTERN/
  • The match is done on the string "some IP, " (0 to many times) followed by "some IP, " (this is the one we want to keep) and finally "some IP - - " (the last IP to discard)
  • There is no need to match the first format of line (only one IP) since it doesn't need changing.
  • The last section contains \2 which refers to the second part of the match in brackets.
  • When run in a shell, many characters must be escaped (with a backslash: ) such as brackets: ( and ), plus: + (which means "at least once") and the literal character period: . (otherwise it is considered a wildcard)
  • The -i option to sed means to change the files in place. Make sure you work on copies!
Jonathan Clarke
  • 1,667
  • 2
  • 11
  • 25
  • 1
    you have a small typo: `sed -i "s/^\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+, )*\` should be `sed -i "s/^\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+, \)*\`. it is a missing backslash at the last bracket. please correct it, i can't edit your post. – Christian Jan 29 '10 at 10:24
  • Huh. Don't know how that happened, I just copied and pasted it from my terminal. It's fixed now. Well spotted Christian, thanks! – Jonathan Clarke Jan 29 '10 at 10:33
  • It makes my eyes bleed almost as much as Perl, but it works. ;-) Thanks! – ThatGraemeGuy Jan 29 '10 at 12:36
0

"It makes my eyes bleed almost as much as Perl, but it works."

use strict;
use warnings;
use Regexp::Common qw /net/;

my $ip;
my $restOfLine;
my @ips;    

while (<>) {
    if (/- -.*/) {
        $restOfLine = $&;
    }
    unless (@ips = /($RE{net}{IPv4})/g) {
        print;
        next;
    }
    if ($ips[1]) {
        $ip = splice(@ips,-2,1);
    } else {
        $ip = $ips[0];
    }
    print "$ip " . $restOfLine . "\n";
}

Makes my eyes bleed less, but maybe that is just me :-)

Kyle Brandt
  • 83,619
  • 74
  • 305
  • 448