Building a distribution of IP's with counts

Question

I am trying to get a more dtrace style distribution output when doing awks on large logfiles after a DDoS so that it is easier to read the output:

# tail -1000 access_log | awk '{ print $1 }' | sort | uniq -c | sort -nr | awk '{printf("\n%s ",$0) ; for (i = 0; i<$1 ; i++) {printf("*")};}'

  43 192.168.0.1 *******************************************
  38 192.168.0.2 **************************************

Hopefully it could look something like:

       value  ------------- Distribution ------------- count    
 192.168.0.1  @@@@@@@@@                                43 
 192.168.0.2  @@@@@@@@                                 38

Where the @'s is a smaller summary of the count vs doing *'s for the number. Getting it to automatically scale per run would be an added bonus vs me having to do maths to figure out how to rank each count.

Please add 5-10 lines sample data from access_log – shellter Apr 27 '11 at 12:32 — shellter, Apr 27 '11 at 12:32

Mike Sherrill 'Cat Recall' · Accepted Answer · 2013-08-07T00:34:28.903

Your pipeline is actually pretty good. You really just need it to scale large numbers. I replaced your tail -1000 access_log | awk '{ print $1 }' | with an unsorted file of ip numbers from one of my web servers. Added head -20 to just print the 20 most active ip addresses.

$  sort ip.txt | uniq -c | sort -nr | \
>  awk 'NR==1{scale=$1/50} \
>       {printf("\n%-23s ",$0) ; \
>        for (i = 0; i<($1/scale) ; i++) {
>            printf("*")}; \
>        }' | head -20

The important parts are

NR==1{scale=$1/50} to calculate the scaling factor to fit the maximum count into 50 characters, and
printf("\n%-23s ",$0) ; uses a width specifier %-23s to left-align the count and ip address within a 23 character space.

My output looks like this. I masked the IP addresses.

   824 xx.xxx.xx.39    **************************************************
   149 xx.xxx.xxx.176  **********
   138 xx.xxx.xxx.191  *********
   137 xx.xxx.xxx.41   *********
   105 xx.xxx.xxx.8    *******
    97 xx.xxx.xxx.21   ******
    96 xx.xxx.xx.220   ******
    91 xx.xx.xxx.198   ******
    87 xx.xxx.xxx.195  ******
    85 xx.xxx.xx.221   ******
    79 xxx.xxx.xxx.86  *****
    69 xx.xx.xx.12     *****
    68 xxx.xxx.xxx.159 *****
    65 xx.xxx.xxx.66   ****
    63 xx.xxx.xx.28    ****
    60 xx.xxx.xxx.104  ****
    59 xxx.xxx.xxx.242 ****
    59 xxx.xx.xxx.66   ****
    56 xx.xxx.xxx.202  ****

This kind of output has a human-factors problem. People judge graphs like these by the area of the lines (the asterisks). Since this display scales with the magnitude of the numbers, you can't visually compare two of these graphs with any reliability.

Your eyes and brain want you to judge the length of the lines. (I'm not sure where I learned this. Maybe from Tufte's books, or from studying statistics.) But the scaling might mean that the longest line on one graph represents 800, while an identical line on another graph might represent only 100. Your eyes and brain want to believe those two are roughly equal, even though one is eight times as big as the other, and even though you can see the raw numbers.

Building a distribution of IP's with counts

1 Answers1

Linked