3

I am processing output from a file in bash and need to group values by their keys.

For example, I have the

13,47099
13,54024
13,1
13,39956
13,0
17,126223
17,52782
17,4
17,62617
17,0
23,1022724
23,79958
23,80590
23,230
23,1
23,118224
23,0
23,1049
42,72470
42,80185
42,2
42,89199
42,0
54,70344
54,72824
54,1
54,62969
54,1

in a file and group all values from a particular key into a single line as in

13,47099,54024,1,39956,0
17,126223,52782,4,62617,0
23,1022724,79958,80590,230,1,118224,0,1049
42,72470,80185,2,89199,0
54,70344,72824,1,62969,1

There are about 10000 entries in my input file. How do I transform this data in shell ?

Anoop
  • 5,540
  • 7
  • 35
  • 52

3 Answers3

5

awk to the rescue!

assuming keys are contiguous...

$ awk -F, 'p!=$1 {if(a) print a; a=p=$1} 
                 {a=a FS $2} 
           END   {print a}' file

13,47099,54024,1,39956,0                                                                                                                  
17,126223,52782,4,62617,0                                                                                                                 
23,1022724,79958,80590,230,1,118224,0,1049                                                                                                
42,72470,80185,2,89199,0                                                                                                                  
54,70344,72824,1,62969,1    
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Perfect answer. . Just what I wanted – Anoop Jun 07 '17 at 18:29
  • The keys are not contiguous, you can `sort` them first and then pipe into the above `awk` code, e.g. `sort -n -k 1 -t "," [file] > awk ...` – Josh Jan 22 '20 at 18:57
  • @karakfa, I'm a bit new to `awk` and trying to understand your code. It appears to check if `p` is not equal to the first field and, if not, set `p` and `a` equal to the first field, and then print `a`. However, the order of the steps I just described is opposite the order of the operations in the first line of your code. Am I understanding your code correctly? – Josh Jan 22 '20 at 19:24
  • [This tutorial](https://www.grymoire.com/Unix/Awk.html) mentions that variable definitions can be set inline with the commands that use them using this example `awk '{print $c}' c="${1:-1}"`, but in that case the variable `c` is set outside the `'{...}'` awk command – Josh Jan 22 '20 at 19:38
  • 1
    @Josh if the key changes print existing record(if exist) and start building the new one. Second statement will be executed regardless of the condition. At the end print the left over record. – karakfa Jan 22 '20 at 20:11
  • @karakfa, thanks! I was writing out what I think the code is doing in prose when you posted. I'll posted my breakdown of your code into an answer for newbs like me. – Josh Jan 22 '20 at 20:41
1

Here is a breakdown of what @karakfa's code is doing, for us awk beginners. I've written this based on a toy dataset file:

1,X
1,Y
3,Z
  • p!=$1: check if the pattern p!=$1 is true
    • checks if variable p is equal to the first field of the current (first) line of file (1 in this case)
    • since p is undefined at this point it cannot be equal to 1, so p!=$1 is true and we continue with this line of code
  • if(a) print a: check if variable a exists and print a if it does exists
    • since a is undefined at this point the print a command is not executed
  • a=p=$1: set variables a and p equal to the value of the first field of the current (first) line (1 in this case)
  • a=a FS $2: set variable a equal to a combined with the value of the second field of the current (first) line separated by the field separator (1,X in this case)
  • END: since we haven't reached the end of file yet, we skip the the rest of this line of code
  • move to the next (second) line of file and restart the awk code on that line

  • p!=$1: check if the pattern p!=$1 is true

    • since p is 1 and the first field of the current (second) line is 1, p!=$1 is false and we skip the the rest of this line of code
  • a=a FS $2: set a equal to the value of a and the value of the second field of the current (second) line separated by the filed separator (1,X,Y in this case)
  • END: since we haven't reached the end of file yet, we skip the the rest of this line of code
  • move to the next (third) line of file and restart the awk code

  • p!=$1: check if the pattern p!=$1 is true

    • since p is 1 and $1 of the third line is 3, p!=$1 is true and we continue with this line of code
  • if(a) print a: check if variable a exists and print a if it does exists
    • since a is 1,X,Y at this point, 1,X,Y is printed to the output
  • a=p=$1: set variables a and p equal to the value of the first field of the current (third) line (3 in this case)
  • a=a FS $2: set variable a equal to a combined with the value of the second field of the current (third) line separated by the field separator (3,Z in this case)
  • END {print a}: since we have reached the end of file, execute this code
    • print a: print the last group a (3,Z in this case)

The resulting output is

1,X,Y
3,Z

Please let me know if there are any errors in this description.

Josh
  • 1,210
  • 12
  • 30
0

Slight tweak to @karakfa's answer. If you want the separator between the key and the values to be different than the separator between the values, you can use this code:

awk -F, 'p==$1 {a=a "; " $2} p!=$1 {if(a) print a; a=$0; p=$1} END {print a}'
Josh
  • 1,210
  • 12
  • 30