2

I am trying to count the number of distinct values in field 12 of a file using gawk 4.1.4, and also count the number of times each of those values occurs. I have two short programs which are giving me different answers for the first question, and I am at a loss to explain why.

{if(a[$12]++==1){count++}} END {print count}

...gives a result of 435,176, whereas

{a[$12]++} END {for (i in a){count++};print count}

...gives a result of 599,845.

Can you explain this behaviour, and tell me which value is correct? I am running under Windows (ezwinport) and the field separator is tab.

Matti Wens
  • 740
  • 6
  • 24

2 Answers2

2

Obviously the 2nd one seems right! You already have the count stored and you don't need a separate variable

The way you are using the count to identify the unique occurrence is wrong in both the cases in the sense it is not tracked per unique instance.

Use the value from the array itself.

The logic in deriving count

{if(a[$12]++==1){count++}} END {print count}

is wrong, but the fact what it does is with post-increment operator only when a field in $12 occurs for the second time it is tracked in the count variable. Hence the lesser count you are seeing in your output.

On the other hand,

{a[$12]++} END {for (i in a){count++};print count}

is almost right, but you don't need a count variable, you already have it stored as part of the value in the array a, indexed by the unique value $12. Doing the above is also the same as

{a[$12}++; next} END {for (i in a) print a[i]}

A small example to demonstrate it,

cat file
1 2 3
1 2 3
1 2 1
1 1 1
2 3 1
3 4 1

assuming I am worried about unique instances and their occurrence count in $2. Doing your first example,

awk '{if(a[$2]++==1){count++}}END {for (i in a) print i,a[i],count}' file
1 1 1
2 3 1
3 1 1
4 1 1

see the wrong value of count printed in the last column, if you can see it carefully, the variable is not even keep tracking the count per instance but a common variable for all instances.

The second approach, seemingly looks good, but prints count as 4 not clear for which instance, assuming multiple instances and their counts could possibly occur. The right way would be to do,

awk '{a[$2]++; next}END {for (i in a) print i,a[i]}' file
1 1
2 3
3 1
4 1

Here instead of count, the a[i] holds the unique count occurrence of the each of the unique value from the column 2.

Inian
  • 80,270
  • 14
  • 142
  • 161
  • The purpose of both the snippets was to simply count the number of unique values, whilst summing the counts of each value but not printing them. – Matti Wens Apr 10 '17 at 09:53
  • @MattWenham: Fair enough, don't print them, just use the array value. – Inian Apr 10 '17 at 09:54
2

The first one is wrong (logically, not syntactically, thank you for emphasizing the fact, @GeorgeVasiliou), because you need to ++ before ==: ++a[$1]==1 :

$ awk '{if(++a[$1]==1){count++}} END {print count}' foo
3

Oh yeah, my test foo:

$ cat foo
1
1
1
2
2
3
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    Yes, or can use `{if(a[$12]++==0){count++}} END {print count}` but I feel that's slightly harder-to-grok code. – Matti Wens Apr 10 '17 at 09:59
  • 2
    Good stuff! My only problem with this is, `count` is not needed, you have the array value itself to make it work! – Inian Apr 10 '17 at 10:00
  • @Inian Tru dat. Since gawk (lol) was mentioned in the OP, one could get the number of distinct values with `length()`: `{a[$1]++} END {print length(a)}`, right? – James Brown Apr 10 '17 at 10:03
  • 1
    It seems like that works too. I wasn't aware you could use `length` on an array in this way, thanks. – Matti Wens Apr 10 '17 at 10:08
  • 1
    The first one used by OP is wrong in terms of specific programming logic but not wrong in terms of syntax. The syntax `if(a[$12]++==1){count++}` will increase `count` by 1 every time that `$12` is found exactly two times. This is because of the post increment - `a[$12]` is first evaluated by if and then increased with `++` operation. – George Vasiliou Apr 10 '17 at 10:15
  • Yep, this correctly describes the bug in my first example snippet, I needed to use pre- and not post-fix `++`. The `length` example above is more succinct, however. – Matti Wens Apr 10 '17 at 10:28