Using awk to sort fields and arrange

Question

Im trying to learn awk at the moment and I want to do a specific task. My question is similar in scope to one previously posted(Using awk to transpose column to row), but wouldn't quite work for my data. I have been trying to work out why and im sure its quite simple.

I have large data in a tab delimited table with only two fields (example below):

and I want to end up appending the second field onto a row when the field matches. The desired output would be:

1101\t7778\t7755\t8889\t6789
2300\t1220
4000\t2333\t7555\t9000\t1111

If possible, Id like to get an explaination of all the parts within the command so I can understand it in the future. Thanks in advance.

well +1 for anyone who gives a pure bash solution :) – abasu May 25 '13 at 08:07 — abasu, May 25 '13 at 08:07
@gammyknee: And the Oscar goes to...? – TrueY May 30 '13 at 09:23 — TrueY, May 30 '13 at 09:23

Jonathan Leffler · Answer 1 · 2013-05-25T04:35:57.800

5

awk '    { list[$1] = list[$1] "\t" $2 }
     END { for (i in list) printf "%s%s\n", i, list[i] }' data

The first line adds a tab and the second field to the list element indexed by $1. The second line prints out the key and the accumulated list of values.

Sample output:

1101    7778    7755    8889    6789
4000    2333    7555    9000    1111
2300    1220

If you want the first column sorted, you can pipe the output through sort -n. If you have GNU awk, you can investigate the built-in sort function too:

/usr/gnu/bin/awk '    { list[$1] = list[$1] "\t" $2 }
                  END { n = asorti(list, indexes);
                        for (i = 1; i <= n; i++)
                            printf "%s%s\n", indexes[i], list[indexes[i]]
                      }' data

Sorted output:

1101    7778    7755    8889    6789
2300    1220
4000    2333    7555    9000    1111

edited May 25 '13 at 04:35

answered May 25 '13 at 04:27

Jonathan Leffler

730,956
141
904
1,278

Excellent thanks! Would there be any conflict with different characters? Using the numerical data this answer worked perfectly, but with some other characters the solution doesn't work. for eg: PAK_74_MK_1st_fwd\tPBK_85_MK_2nd_RC PAK_74_MK_1st_fwd\tPBK_124_MK_2nd_RC PAK_74_MK_1st_fwd\tPBK_94_MK_2nd_RC PAK_74_MK_1st_fwd\tPBK_150_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_128_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_134_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_139_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_69_MK_2nd_RC Is there any reason why it won't work? – gammyknee May 25 '13 at 23:48
Actually that works anyway. Must be a dodgey character later on in the file. Thanks for the explanation about the awk command :) – gammyknee May 25 '13 at 23:54
It should be fine with a fairly varied range of inputs. It splits on white space (arbitrary sequences of blanks and tabs); it does not require the key values in any specific order, nor all the entries for a given key to appear on adjacent lines. If something contained two values for one key, then you'd need to iterate on `NF` (the number of fields) in the accumulate line. If you needed to sort the values for a key, that would require extra processing. About the only thing that would stop it 'working' is an actual backslash lower-case 't' sequence in the data. – Jonathan Leffler May 26 '13 at 01:40

score 3 · Answer 2 · answered May 25 '13 at 13:02

3

For abasu's request a pure bash version:

#!/bin/bash

declare -A hash
while read x y; do
  hash[$x]=${hash[$x]}"\t"$y
done <<XXX
1101    7778
1101    7755
1101    8889
1101    6789
2300    1220
4000    2333
4000    7555
4000    9000
4000    1111
XXX

for i in ${!hash[*]}; { echo -e $i${hash[$i]};}

Output:

2300    1220
1101    7778    7755    8889    6789
4000    2333    7555    9000    1111

In the here-is-the-document there is a tab character between the columns, as well as in between the output columns. If the -e is removed from the last line after echo the output is:

2300\t1220
1101\t7778\t7755\t8889\t6789
4000\t2333\t7555\t9000\t1111

answered May 25 '13 at 13:02

TrueY

7,360
1
41
46

awesome :) nice solution, without any external tool – abasu May 25 '13 at 14:22
thats nothing, if you make the last line as `for i in ${!hash[*]}; { echo $i${hash[$i]};} | sort` it will give a sorted output – abasu May 25 '13 at 15:41
1

@abasu: Unfortunately `sort` is not a `bash` internal command so using it would not be a pure `bash` solution! The `sort` functionality should be programmed as a `bash` function, but I skip it now. :) – TrueY May 25 '13 at 15:47
true, I missed the point, `sort` makes it a non-pure bash solution. – abasu May 25 '13 at 15:51

glenn jackman · Answer 3 · 2013-05-25T18:16:10.383

3

This version does not end up storing the whole file in memory. Neither does it rearrange the order of the keys.

awk -F '\t' '
    $1 != prev {
        if (prev) print ""
        printf "%s", $1
        prev=$1
    }
    {printf "%s%s", FS, $2}
    END {print ""}
' f

1101    7778    7755    8889    6789
2300    1220
4000    2333    7555    9000    1111

edited May 25 '13 at 18:16

answered May 25 '13 at 18:07

glenn jackman

238,783
38
220
352

score 1 · Answer 4 · edited May 23 '17 at 12:05

1

Inspired from Kent's answer.

awk '{
         a[$1]=a[$1] ? a[$1] FS $2 : $2
    } 
END {
        for (key in a) print key,a[key]
    }' FS='\t' OFS='\t' f

edited May 23 '17 at 12:05

Community

1
1

answered May 25 '13 at 04:30

jaypal singh

74,723
23
102
147

score 0 · Answer 5 · answered May 25 '13 at 19:15

Another pure bash implementation just for fun, uses string slicing instead of regexps assuming 4 digit numbers only, so it's not robust, but the version of bash I'm working with doesn't have built-in regexp so I was at a loss as to what else I could do!

#!/bin/bash

while read line; do
        array[${line:0:4}]="${array[${line:0:4}]}${line:4:8}"
        indicies[${line:0:4}]=${line:0:4}

done < $1

for i in ${indicies[@]}; do
        echo "$i${array[$i]}"
done

Using awk to sort fields and arrange

5 Answers5

Linked