I have HTTP header request and reply data in tab delimited form with each GET/POST and reply in different lines. This data is such that there are multiple GET, POST and REPLY for one TCP flow. I need to choose only the first valid GET - REPLY pair out of these cases. An example (simplified) is:
ID Source Dest Bytes Type Content-Length host lines....
1 A B 10 GET NA yahoo.com 2
1 A B 10 REPLY 10 NA 2
2 C D 40 GET NA google.com 4
2 C D 40 REPLY 20 NA 4
2 C D 40 GET NA google.com 4
2 C D 40 REPLY 30 NA 4
3 A B 250 POST NA mail.yahoo.com 5
3 A B 250 REPLY NA NA 5
3 A B 250 REPLY 15 NA 5
3 A B 250 GET NA yimg.com 5
3 A B 250 REPLY 35 NA 5
4 G H 415 REPLY 10 NA 6
4 G H 415 POST NA facebook.com 6
4 G H 415 REPLY NA NA 6
4 G H 415 REPLY NA NA 6
4 G H 415 GET NA photos.facebook.com 6
4 G H 415 REPLY 50 NA 6
....
So, basically I need to get one request-reply pair for each ID and write them to a new file.
For '1' it is just one pair so it is easy. But there are also false cases with both lines being a GET, POST or REPLY. So, such cases are ignored.
For '2', I would choose the first GET - REPLY pair.
For '3', I would choose the first GET but the second REPLY as the Content-Length is absent in the first (making the subsequest REPLY a better candidate).
For '4', I would choose the first POST (or GET) as the first header cannot be REPLY. I would not choose the REPLY after the second GET even though the content length is missing in ones after the POST., as the REPLY comes after that. So I would just choose the first REPLY.
So, after choosing the best request and reply pair, I need to pair them up in a single line. For the example, the output would be:
ID Source Dest Bytes Type Content-Length host ....
1 A B 10 GET 10 yahoo.com
2 C D 40 GET 20 google.com
3 A B 250 POST 15 mail.yahoo.com
4 G H 415 POST NA facebook.com
There are a lot of other headers in the actual data but this example pretty much shows what I need. How would one do this in Perl? I pretty much am stuck in the beginning so I have only been able to read the file one line at a time.
open F, "<", "file.txt" || die "Cannot open $f: $!";
while (<F>) {
chomp;
my @line = split /\t/;
# get the valid pairs for cases with multiple request - replies
# get the paired up data together
}
close (F);
*Edit: I have added an additional column giving the number of HTTP header lines for each ID. This may help to know how many subsequent lines to check. Also, I modified ID '4' so that the first header line is a REPLY. *