Counting the number of unique values based on two columns in bash

Question

I have a tab-separated file looking like this:

A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234

Output: 
A 3
B 2
C 1

Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.

awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci

I'd really appreciate your help! Thank you in advance.

sort + uniq and then [this](https://stackoverflow.com/questions/27986425/using-awk-to-count-the-number-of-occurrences-of-a-word-in-a-column) — KamilCuk, Jun 19 '20 at 09:45
Why are you using `cut -d' '` (i.e. telling `cut` to use a blank instead of a tab as the separator) when your file is tab-separated? — Ed Morton, Jun 19 '20 at 12:18

RavinderSingh13 · Accepted Answer · 2020-06-19T10:11:03.717

6

With complete awk solution could you please try following.

awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file

Explanation: Adding detailed explanation for above.

awk '                  ##Starting awk program from here.
BEGIN{
  FS=OFS="\t"
}
!found[$0]++{       ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
  val[$1]++            ##Creating val with 1st column inex and keep increasing its value here.
}
END{                   ##Starting END block of this progra from here.
  for(i in val){       ##Traversing through array val here.
    print i,val[i]     ##Printing i and value of val with index i here.
  }
}
'  Input_file          ##Mentioning Input_file name here.

edited Jun 19 '20 at 10:11

answered Jun 19 '20 at 09:47

RavinderSingh13

130,504
14
57
93

Thank you, but I still get something like: A 1234 2 // A 123245 1 // A 4546 1 etc. – ta4le Jun 19 '20 at 09:51
@ta4le, it worked fine for me with given examples, could you please do lemme know if your actual Input_file is same as shown samples? – RavinderSingh13 Jun 19 '20 at 09:56
@ta4le, is your Input_file comma separated? Kindly confirm the same once. – RavinderSingh13 Jun 19 '20 at 09:59
no, it is more complicated but the semantics is the same: two columns, keys and multiple values for them. Tab separated – ta4le Jun 19 '20 at 10:00
@ta4le, ok if its TAB separated then I made changes in my code where I made field separator as TAB kindly do check it once and lemme know then. – RavinderSingh13 Jun 19 '20 at 10:05
1

I suggest using `found[$0]` instead of `found[$1,$2]` – anubhava Jun 19 '20 at 10:10

score 2 · Answer 2 · answered Jun 19 '20 at 10:00

Using GNU awk:

$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file

Output:

A 3
B 2
C 1

Explained:

 $ gawk -F\\t '{               # using GNU awk and tab as delimiter
    a[$1][$2]                  # hash to 2D array
 }
 END {                         
     for(i in a)               # for all values in first field
         print i,length(a[i])  # output value and the size of related array
 }' file

score 2 · Answer 3 · answered Jun 19 '20 at 12:15

2

$ sort -u file | cut -f1 | uniq -c
   3 A
   2 B
   1 C

answered Jun 19 '20 at 12:15

Ed Morton

188,023
17
78
185

score 2 · Answer 4 · answered Jun 19 '20 at 22:13

2

Another way, using the handy GNU datamash utility:

$ datamash -g1 countunique 2 < input.txt
A   3
B   2
C   1

Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.

answered Jun 19 '20 at 22:13

Shawn

47,241
3
26
60

score 1 · Answer 5 · answered Jun 19 '20 at 09:57

1

You could try this:

cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'

It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

answered Jun 19 '20 at 09:57

GBLightning

9
1

Counting the number of unique values based on two columns in bash

5 Answers5

Linked