4

I have HTTP header request and reply data in tab delimited form with each GET/POST and reply in different lines. This data is such that there are multiple GET, POST and REPLY for one TCP flow. I need to choose only the first valid GET - REPLY pair out of these cases. An example (simplified) is:

ID       Source    Dest    Bytes   Type   Content-Length  host               lines.... 
1         A         B       10     GET        NA          yahoo.com            2
1         A         B       10     REPLY      10          NA                   2 
2         C         D       40     GET        NA          google.com           4
2         C         D       40     REPLY      20          NA                   4
2         C         D       40     GET        NA          google.com           4
2         C         D       40     REPLY      30          NA                   4
3         A         B       250    POST       NA          mail.yahoo.com       5
3         A         B       250    REPLY      NA          NA                   5
3         A         B       250    REPLY      15          NA                   5
3         A         B       250    GET        NA          yimg.com             5
3         A         B       250    REPLY      35          NA                   5
4         G         H       415    REPLY      10          NA                   6
4         G         H       415    POST       NA          facebook.com         6
4         G         H       415    REPLY      NA          NA                   6
4         G         H       415    REPLY      NA          NA                   6
4         G         H       415    GET        NA          photos.facebook.com  6
4         G         H       415    REPLY      50          NA                   6

....

So, basically I need to get one request-reply pair for each ID and write them to a new file.

For '1' it is just one pair so it is easy. But there are also false cases with both lines being a GET, POST or REPLY. So, such cases are ignored.

For '2', I would choose the first GET - REPLY pair.

For '3', I would choose the first GET but the second REPLY as the Content-Length is absent in the first (making the subsequest REPLY a better candidate).

For '4', I would choose the first POST (or GET) as the first header cannot be REPLY. I would not choose the REPLY after the second GET even though the content length is missing in ones after the POST., as the REPLY comes after that. So I would just choose the first REPLY.

So, after choosing the best request and reply pair, I need to pair them up in a single line. For the example, the output would be:

 ID       Source    Dest    Bytes   Type   Content-Length  host         .... 
   1         A         B       10     GET      10          yahoo.com
   2         C         D       40     GET      20          google.com
   3         A         B       250    POST     15          mail.yahoo.com
   4         G         H       415    POST     NA          facebook.com

There are a lot of other headers in the actual data but this example pretty much shows what I need. How would one do this in Perl? I pretty much am stuck in the beginning so I have only been able to read the file one line at a time.

open F, "<", "file.txt" || die "Cannot open $f: $!";

  while (<F>) {
    chomp;
    my @line = split /\t/;


      # get the valid pairs for cases with multiple request - replies


      # get the paired up data together

  }
  close (F);

*Edit: I have added an additional column giving the number of HTTP header lines for each ID. This may help to know how many subsequent lines to check. Also, I modified ID '4' so that the first header line is a REPLY. *

sfactor
  • 12,592
  • 32
  • 102
  • 152
  • 2
    +1 for the detailed explanation of what's needed. Thank you! – Jonathan Leffler Apr 29 '12 at 15:49
  • Is the ID sufficient to identify the group of lines to be processed? If so, then within the ID, can we assume that source and destination are the same? – Jonathan Leffler Apr 29 '12 at 15:51
  • @JonathanLeffler Yes, that is sufficient as it represents one TCP flow with same source and destination, ports etc. So, I need to make one request-reply pair for each ID as shown. – sfactor Apr 29 '12 at 15:54
  • Can we assume that the lines for a single ID are consecutive? i.e. when the ID changes from 1 to 2, does that guarantee there are no more lines with ID 1? – cjm Apr 29 '12 at 16:45
  • @cjm yes, all the lines with the same ID are grouped together. – sfactor Apr 29 '12 at 16:52
  • @sfactor: please explain *"I would not choose the REPLY after the second GET ... as the REPLY comes after that"*. It seems to me that a REPLY always comes after a GET! – Borodin Apr 29 '12 at 17:54
  • @sfactor: to summarize the REPLY to choose. *A reply with a content length always beats one without. After that, an earlier reply beats a later one.* Is that right? If so I still don't understand choosing the first reply instead of the one after the GET. – Borodin Apr 29 '12 at 18:02
  • @Borodin The thinking behind that statement is that I try to choose a REPLY with content length as far as possible. But the REPLY after the second GET would belong to that second GET not the first one, wouldn't it? The two consecutive REPLIES might be for the same GET? Unfortunately there is no status code to tell if the request was fulfilled or not. – sfactor Apr 29 '12 at 18:31
  • @sfactor: yes, I guess the last REPLY belongs to the GET and not the POST. But I would expect you to want that last GET/REPLY pair because the REPLY has a length, rather than the first POST request together with a REPLY without a length. Do you *always* want the *first* request for an ID together with its 'best' REPLY? – Borodin Apr 29 '12 at 22:47
  • @Borodin yes the idea is to get the first GET and the corresponding reply. – sfactor May 01 '12 at 21:55
  • @sfactor: please explain what you mean by improving *the HTTP request-reply logic* – Borodin May 02 '12 at 14:35

2 Answers2

3

The program below does what I think you need.

It is commented and I think it is fairly legible. Please ask if anything is unclear.

use strict;
use warnings;

use List::Util 'max';

my $file = $ARGV[0] // 'file.txt';
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);

# Read the field names from the first line to index the hashes
# Remember where the data in the file starts so we can get back here
#
my @fields = split ' ', <$fh>;
my $start = tell $fh;

# Build a format to print the accumulated data
# Create a hash that relates column headers to their widths
#
my @headers = qw/ ID Source Dest Bytes Type Content-Length host /;
my %len = map { $_ => length } @headers;

# Read through the file to find the maximum data width for each column
#
while (<$fh>) {
  my %data;
  @data{@fields} = split;
  next unless $data{ID} =~ /^\d/;
  $len{$_} = max($len{$_}, length $data{$_}) for @headers;
}

# Build a format string using the values calculated
#
my $format = join '   ', map sprintf('%%%ds', $_), @len{@headers};
$format .= "\n";

# Go back to the start of the data
# Print the column headers
#
seek $fh, $start, 0;
printf $format, @headers;

# Build transaction data hashes into $record and print them
# Ignore any events before the first request
# Ignore the second request and anything after it
# Update the stored Content-Length field if a value other than NA appears
#
my $record;
my $nreq = 0;

while (<$fh>) {

  my %data;
  @data{@fields} = split;
  my ($id, $type) = @data{ qw/ ID Type / };
  next unless $id =~ /^\d/;

  if ($record and $id ne $record->{ID}) {
    printf $format, @{$record}{@headers};
    undef $record;
    $nreq = 0;
  }

  if ($type eq 'GET' or $type eq 'POST') {
    $record = \%data if $nreq == 0;
    $nreq++;
  }
  elsif ($nreq == 1) {
    if ($record->{'Content-Length'} eq 'NA' and $data{'Content-Length'} ne 'NA') {
      $record->{'Content-Length'} = $data{'Content-Length'};
    }
  }
}

printf $format, @{$record}{@headers} if $record;

output

With the data given in the question, this program produces

ID   Source   Dest   Bytes    Type   Content-Length                  host
 1        A      B      10     GET               10             yahoo.com
 2        C      D      40     GET               20            google.com
 3        A      B     250    POST               15        mail.yahoo.com
 4        G      H     415    POST               NA          facebook.com
Borodin
  • 126,100
  • 9
  • 70
  • 144
1

This seems to work on the given data:

#!/usr/bin/env perl
use strict;
use warnings;

# Shape of input records
use constant ID       => 0;
use constant Source   => 1;
use constant Dest     => 2;
use constant Bytes    => 3;
use constant Type     => 4;
use constant Length   => 5;
use constant Host     => 6;

use constant fmt_head => "%-6s  %-6s  %-6s  %-6s  %-6s  %-6s  %s\n";
use constant fmt_data => "%-6d  %-6s  %-6s  % 6d  %-6s  % 6s  %s\n";

printf fmt_head, "ID", "Source", "Dest", "Bytes", "Type", "Length", "Host";

my @post_get;
my @reply;
my $lastid = -1;
my $pg_count = 0;

sub print_data
{
    # Final validity checking
    if ($lastid != -1)
    {
        printf fmt_data, $post_get[ID], $post_get[Source],
               $post_get[Dest], $post_get[Bytes], $post_get[Type], $reply[Length], $post_get[Host];
        # Reset arrays;
        @post_get = ();
        @reply = ();
        $pg_count = 0;
    }
}

while (<>)
{
    chomp;
    my @record = split;
    # Validate record here (number of fields, etc)
    # Detect change in ID
    print_data if ($record[ID] != $lastid);
    $lastid = $record[ID];

    if ($record[Type] eq "REPLY")
    {
        # Discard REPLY if there wasn't already a POST/GET
        next unless defined $post_get[ID];
        # Discard REPLY if there was a second POST/GET
        next if $pg_count > 1;
        @reply = @record if !defined $reply[ID];
        $reply[Length] = $record[Length]
                         if $reply[Length] eq "NA" && $record[Length] ne "NA";
    }
    else
    {
        $pg_count++;
        @post_get = @record if !defined $post_get[ID];
        $post_get[Length] = $record[Length]
                            if $post_get[Length] eq "NA" && $record[Length] ne "NA";
    }
}
print_data;

It produces:

ID   Source   Dest   Bytes   Type   Content-Length             host
 1        A      B      10    GET               10        yahoo.com
 2        C      D      40    GET               20       google.com
 3        A      B     250   POST               15   mail.yahoo.com
 4        G      H     415   POST               NA     facebook.com

The main deviation from the question is the substitution of 'Length' for 'Content-Length'; the fix is easy if enough if desired — change the 6th length in the fmt_data and fmt_head to length 14, and change "Length" to "Content-Length".

Borodin
  • 126,100
  • 9
  • 70
  • 144
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    Using global variables in `print_data` and relying on it to reset those globals might not be the best idea. Use references instead, and clear the arrays in the main loop. Also, `chomp` is not required with a split on whitespace. However, respecting the tab-delimited format and using `chomp` + `split /\t/` would be the better option, IMO. – TLP Apr 29 '12 at 17:14
  • Also, using an array slice `printf fmt_data, @post_get[ID, Source, Dest, Bytes, Type], $reply[Length], $post_get[Host]` is a bit more readable. – TLP Apr 29 '12 at 17:19
  • 2
    @Jonathan Leffler: it seems perverse to use an array indexed by what is effectively an `enum` instead of a simple hash. – Borodin Apr 29 '12 at 18:11