Determine number of unique lines with awk or similar in bash

Question

I am using AWK to read through a custom log file I have. The format is something like this:

[12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.4:8091 HTTP/1.0" 200

Right now, I have AWK (from bash) set to read the whole log, analyze each line and grab each line that contains "CONNECT" which works, however, it does not help me discover unique clients.

The way to do this would be to somehow filter it so that it analyzed this part of each line: "CONNECT 192.168.2.4:8091 HTTP/1.0"

If there was a way to grab all those lines in a log file, then compare them all and only count similar lines as one. So let's say, for example:

 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.6:8091 HTTP/2.0" 200
 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.9:8091 HTTP/2.0" 200
 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.2:8091 HTTP/2.0" 200
 [12:08:00 +0000] 192.168.2.3 98374 "CONNECT 192.168.2.9:8091 HTTP/2.0" 200

In this case, the answer I need would be 3, not 4. Because 2 lines are the same, so there are only 3 unique lines. What I need is an automated way to accomplish this with AWK.

If anybody can lend a hand that would be great.

SmallClanger · Answer 1 · 2013-03-19T13:36:01.850

3

sed -re 's/.*"([^"]*)".*/\1/' <logfile> |sort |uniq

Awk variant: awk -F'"' {print $2} <logfile> |sort |uniq

Add -c to uniq to get a count of each matching line, or |wc -l to get a count of the number of matching lines.

edited Mar 19 '13 at 13:36

answered Mar 19 '13 at 09:09

SmallClanger

9,127
1
32
47

score 3 · Answer 2 · answered Mar 19 '13 at 12:14

3

You could let awk count unique instances like this:

awk -F\" '/CONNECT/ && !seen[$2] { seen[$2]++ } END { print length(seen) }' logfile

Output:

This collects the first double-quoted string from lines containing CONNECT in the seen hash array. When the end of input is reached, the number of elements in seen is printed.

answered Mar 19 '13 at 12:14

Thor

475
5
14

+1 for a much more elegant answer than mine :) – SmallClanger Mar 19 '13 at 15:54

score 0 · Answer 3 · answered Mar 19 '13 at 22:19

Running the log file through sort | uniq should filter out the duplicate lines, but i would question why you have those lines in there. Are they really duplicates?

If they are legitimate log entries and all you want is a unique list of clients (the 2nd field) for lines which are non-duplicates, then a simple modification of @Thor's script should get you what you want:

awk '
/CONNECT/ {
  if (seen[$0] == 0) {
    clients[$3]++
  }
  seen[$0]++ 
} 
END {
  for (i in clients) {
    print i
  }
}'

Which for the sample you've given results in:

192.168.2.3

This isn't as compact as Thor's script, but i usually find that as soon as i've written something like this, i want to do more with the lines themselves, so i've left the seen array (tracking the count of unique lines) in there.

Determine number of unique lines with awk or similar in bash

3 Answers3