Counting the number of unique values based on more than two columns in bash

Question

I need to modify the below code to work on more than one column.

Counting the number of unique values based on two columns in bash

awk '                  ##Starting awk program from here.
BEGIN{
  FS=OFS="\t"
}
!found[$0]++{       ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
  val[$1]++            ##Creating val with 1st column inex and keep increasing its value here.
}
END{                   ##Starting END block of this progra from here.
  for(i in val){       ##Traversing through array val here.
    print i,val[i]     ##Printing i and value of val with index i here.
  }
}
'  Input_file          ##Mentioning Input_file name here.

Table to count how many of each double (all DIS)

  patient  sex    DISa  DISb  DISc  DISd   DISe   DISf  DISg  DISh  DISi
patient1 male   550.1 550.5 594.1 594.3  594.8  591   1019  960.1 550.1
patient2 female 041   208   250.2 276.14 426.32 550.1 550.5 558   041  
patient3 female NA    NA    NA    NA     NA     NA    NA    041   NA

The output I need is:

550.1    3
550.5    2
594.1    1
594.3    1
594.8    1
591    1
1019    1
960.1    1
550.1    1
041    3
208    1
250.2    1
276.14    1
426.32    1
558    1

score 3 · Accepted Answer · answered Jun 15 '21 at 19:48

Consider this awk:

awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file

276.14  1
960.1   1
594.3   1
426.32  1
208 1
041 3
594.8   1
550.1   3
591 1
1019    1
558 1
550.5   2
250.2   1
594.1   1

A more readable form:

awk -v OFS='\t' '
NR > 1 {
   for (i=3; i<=NF; ++i)
      if ($i+0 == $i)
         ++fq[$i]
}
END {
   for (i in fq)
      print i, fq[i]
}' file

$i+0 == $i is a check for making sure column value is numeric.

David C. Rankin · Answer 2 · 2021-06-15T20:01:50.160

If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.

awk '
    BEGIN { OFS = "\t" }
    FNR > 1 { 
        for (i=3;i<=NF;i++)
            if ($i~/^[0-9]/) { 
                if (!($i in a))
                    b[++n] = $i;
                a[$i]++
            }
    }
    END {
        for (i=1;i<=n;i++)
            print b[i], a[b[i]]
}' file

Example Use/Output

$ awk '
>     BEGIN { OFS = "\t" }
>     FNR > 1 {
>         for (i=3;i<=NF;i++)
>             if ($i~/^[0-9]/) {
>                 if (!($i in a))
>                     b[++n] = $i;
>                 a[$i]++
>             }
>     }
>     END {
>         for (i=1;i<=n;i++)
>             print b[i], a[b[i]]
> }' patients
550.1   3
550.5   2
594.1   1
594.3   1
594.8   1
591     1
1019    1
960.1   1
041     3
208     1
250.2   1
276.14  1
426.32  1
558     1

Let me know if you have further questions.

I wasn't sure whether it was needed or not. I guess I should update for `OFS` as well. — David C. Rankin, Jun 15 '21 at 20:01

RavinderSingh13 · Answer 3 · 2021-06-15T23:03:42.990

2

Taking complete solution from above 2 answers(@anubhava and @David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.

1st solution: If order doesn't matter in output try:

awk -v OFS='\t' '
NR > 1 {
   for (i=3; i<=NF; ++i)
      if (int($i))
         ++fq[$i]
}
END {
   for (i in fq)
      print i, fq[i]
}' Input_file

2nd solution: If order matters based on David's answer try.

awk '
    BEGIN { OFS = "\t" }
    FNR > 1 { 
        for (i=3;i<=NF;i++)
            if (int($i)) { 
                if (!($i in a))
                    b[++n] = $i;
                a[$i]++
            }
    }
    END {
        for (i=1;i<=n;i++)
            print b[i], a[b[i]]
}' Input_file

edited Jun 15 '21 at 23:03

answered Jun 15 '21 at 22:31

RavinderSingh13

130,504
14
57
93

2

Most of the OPs numeric values aren't integers and some of his non-numeric values contain digits, converting them to integers and testing the result would fail for a number like `0.5` or a string like `1foo`. The test for a number is `($i+0 == $i)` as in @anubhavas code. – Ed Morton Jun 16 '21 at 04:19
1

@EdMorton, sure thank you sir, I will try to edit them if I get some more thoughts on this one(because anubhava already covered that), thank you. – RavinderSingh13 Jun 16 '21 at 04:20

score 1 · Answer 4 · answered Jun 16 '21 at 04:22

Using GNU awk for multi-char RS:

$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
      3 041
      1 1019
      1 208
      1 250.2
      1 276.14
      1 426.32
      3 550.1
      2 550.5
      1 558
      1 591
      1 594.1
      1 594.3
      1 594.8
      1 960.1

If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

Counting the number of unique values based on more than two columns in bash

4 Answers4