Bash script that counts IDs in square brackets every ten minutes

Question

Having this logfile

20180917084726:-
20180917085418:[111783178, 111557953, 111646835, 111413356, 111412662, 105618372, 111413557]
20180917115418:[111413432, 111633904, 111783198, 111792767, 111557948, 111413225, 111413281]
20180917105419:[111413432, 111633904, 111783198, 111792767, 111557948, 111413225, 111413281]
20180917085522:[111344871, 111394583, 111295547, 111379566, 111352520]
20180917090022:[111344871, 111394583, 111295547, 111379566, 111352520]

The format of the input log is:

timestamp is in format YYYYMMDDhhmmss

I would like to know how to write a script that outputs one line for each ten minute slice of the day the count of unique IDs that were returned

The result is as this one:

20180917084:0
20180917085:12
20180917115:7
20180917105:7

Welcome to SO, please correct your expected output as per shown input itself and let us know then, it doesn't look good as per shown input. — RavinderSingh13, Sep 18 '18 at 18:49
What's the purpose of that log format? It's nonstandard, harder to parse than it needs to be, and less readable than it could be. — chepner, Sep 18 '18 at 19:19
The log was generated by a program that collects IDs per request — Learner, Sep 18 '18 at 20:07

glenn jackman · Accepted Answer · 2018-09-18T20:13:58.453

awk: Uses colon or comma as the field separator.

awk -F '[,:]' '
    {
        key = substr($1,1,11)"0"
        count[key] += ($2 == "-" ? 0 : NF-1)
    } 
    END {
        PROCINFO["sorted_in"] = "@ind_num_asc"
        for (key in count) print key, count[key]
    }
' file

201809170840 0
201809170850 12
201809170900 5
201809171050 7
201809171150 7

To filter on today's date, you could say:

gawk -F '[,:]' '
    BEGIN {today = strftimme("%Y%m%d", systime())}
    $0 ~ "^"today { key = ...

or

awk -F '[,:]' -v "today=$(date "+%Y%m%d")" '
    $0 ~ "^"today { key = ...

or pipe the existing awk code to | grep "^$(date +%Y%m%d)"

It looks very good, what if I want to print only info for today? 20180917 (based on date) and no other days? — Learner, Sep 18 '18 at 20:06

RavinderSingh13 · Answer 2 · 2018-09-18T19:40:58.250

Could you please try following, it will be give you output in same order in which timestamp occurrence is happening in Input_file.

awk '
{
  val=substr($0,1,11)
}
!a[val]++{
  b[++count]=val
}
match($0,/\[.*\]/){
  num=split(substr($0,RSTART,RLENGTH),array,",")
  c[val]+=num
}
END{
  for(i=1;i<=count;i++){
    print b[i],c[b[i]]+0
  }
}'   Input_file

Output will be as follows.

20180917084 0
20180917085 12
20180917115 7
20180917105 7
20180917090 5

EDIT: Adding a solution in case your any of the field is having NULL value so putting a check in above code too now.

awk '
{
  val=substr($0,1,11)
}
!a[val]++{
  b[++count]=val
}
match($0,/\[.*\]/){
  count1=""
  num=split(substr($0,RSTART,RLENGTH),array,",")
  for(j=1;j<=num;j++){
    if(array[j]){
      count1++
    }
  }
  c[val]+=count1
}
END{
  for(i=1;i<=count;i++){
    print b[i],c[b[i]]+0
  }
}'  Input_file

Good answer as well, Can you have the chance to easily validate only for today's date? — Learner, Sep 18 '18 at 20:08

karakfa · Answer 3 · 2018-09-18T20:33:48.517

1

your input and output are not consistent but I guess you want something like this

 $ awk -F: '{k=sprintf("%10d",$1/1000); n=gsub(",",",",$2); a[k]+=(n?n+1:n)} 
        END {for(k in a) print k":"a[k] | "sort" }' file 

20180917084:0
20180917085:12
20180917090:5
20180917105:7
20180917115:7

edited Sep 18 '18 at 20:33

answered Sep 18 '18 at 20:28

karakfa

66,216
7
41
56

choroba · Answer 4 · 2018-09-18T19:45:45.873

0

Perl to the rescue!

perl -ne '
    ($timestamp, @ids) = /([0-9]+)/g;
    substr $timestamp, -3, 3, "";
    @{ $seen{$timestamp} }{@ids} = ();
    END {
        for my $timestamp (sort keys %seen) {
            print "$timestamp:", scalar keys %{ $seen{$timestamp} }, "\n";
        }
    }' < file.log

-n reads the input line by line
substr here replaces the last three characters of the timestamp with an empty string
%seen is a hash of hashes, for each timestamp the inner hash records what ids were seen
keys in scalar context return the count of the keys, in this case the number of unique ids per timestamp.

edited Sep 18 '18 at 19:45

answered Sep 18 '18 at 19:19

choroba

231,213
25
204
289

1

I'm curious why you didn't use a hash of arrays: `push @{$seen{$timestamp}}, @ids;`, then `print $timestamp, ":", scalar @{$seen{$timestamp}}, "\n";` – glenn jackman Sep 18 '18 at 19:30

Bash script that counts IDs in square brackets every ten minutes

4 Answers4