4

Im trying to learn awk at the moment and I want to do a specific task. My question is similar in scope to one previously posted(Using awk to transpose column to row), but wouldn't quite work for my data. I have been trying to work out why and im sure its quite simple.

I have large data in a tab delimited table with only two fields (example below):

1101\t7778
1101\t7755
1101\t8889
1101\t6789
2300\t1220
4000\t2333
4000\t7555
4000\t9000
4000\t1111

and I want to end up appending the second field onto a row when the field matches. The desired output would be:

1101\t7778\t7755\t8889\t6789
2300\t1220
4000\t2333\t7555\t9000\t1111

If possible, Id like to get an explaination of all the parts within the command so I can understand it in the future. Thanks in advance.

Community
  • 1
  • 1
gammyknee
  • 59
  • 1
  • 3

5 Answers5

5
awk '    { list[$1] = list[$1] "\t" $2 }
     END { for (i in list) printf "%s%s\n", i, list[i] }' data

The first line adds a tab and the second field to the list element indexed by $1. The second line prints out the key and the accumulated list of values.

Sample output:

1101    7778    7755    8889    6789
4000    2333    7555    9000    1111
2300    1220

If you want the first column sorted, you can pipe the output through sort -n. If you have GNU awk, you can investigate the built-in sort function too:

/usr/gnu/bin/awk '    { list[$1] = list[$1] "\t" $2 }
                  END { n = asorti(list, indexes);
                        for (i = 1; i <= n; i++)
                            printf "%s%s\n", indexes[i], list[indexes[i]]
                      }' data

Sorted output:

1101    7778    7755    8889    6789
2300    1220
4000    2333    7555    9000    1111
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Excellent thanks! Would there be any conflict with different characters? Using the numerical data this answer worked perfectly, but with some other characters the solution doesn't work. for eg: PAK_74_MK_1st_fwd\tPBK_85_MK_2nd_RC PAK_74_MK_1st_fwd\tPBK_124_MK_2nd_RC PAK_74_MK_1st_fwd\tPBK_94_MK_2nd_RC PAK_74_MK_1st_fwd\tPBK_150_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_128_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_134_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_139_MK_2nd_RC PAK_75_MK_1st_fwd\tPBK_69_MK_2nd_RC Is there any reason why it won't work? – gammyknee May 25 '13 at 23:48
  • Actually that works anyway. Must be a dodgey character later on in the file. Thanks for the explanation about the awk command :) – gammyknee May 25 '13 at 23:54
  • It should be fine with a fairly varied range of inputs. It splits on white space (arbitrary sequences of blanks and tabs); it does not require the key values in any specific order, nor all the entries for a given key to appear on adjacent lines. If something contained two values for one key, then you'd need to iterate on `NF` (the number of fields) in the accumulate line. If you needed to sort the values for a key, that would require extra processing. About the only thing that would stop it 'working' is an actual backslash lower-case 't' sequence in the data. – Jonathan Leffler May 26 '13 at 01:40
3

For abasu's request a pure bash version:

#!/bin/bash

declare -A hash
while read x y; do
  hash[$x]=${hash[$x]}"\t"$y
done <<XXX
1101    7778
1101    7755
1101    8889
1101    6789
2300    1220
4000    2333
4000    7555
4000    9000
4000    1111
XXX

for i in ${!hash[*]}; { echo -e $i${hash[$i]};}

Output:

2300    1220
1101    7778    7755    8889    6789
4000    2333    7555    9000    1111

In the here-is-the-document there is a tab character between the columns, as well as in between the output columns. If the -e is removed from the last line after echo the output is:

2300\t1220
1101\t7778\t7755\t8889\t6789
4000\t2333\t7555\t9000\t1111
TrueY
  • 7,360
  • 1
  • 41
  • 46
  • awesome :) nice solution, without any external tool – abasu May 25 '13 at 14:22
  • thats nothing, if you make the last line as `for i in ${!hash[*]}; { echo $i${hash[$i]};} | sort` it will give a sorted output – abasu May 25 '13 at 15:41
  • 1
    @abasu: Unfortunately `sort` is not a `bash` internal command so using it would not be a pure `bash` solution! The `sort` functionality should be programmed as a `bash` function, but I skip it now. :) – TrueY May 25 '13 at 15:47
  • true, I missed the point, `sort` makes it a non-pure bash solution. – abasu May 25 '13 at 15:51
3

This version does not end up storing the whole file in memory. Neither does it rearrange the order of the keys.

awk -F '\t' '
    $1 != prev {
        if (prev) print ""
        printf "%s", $1
        prev=$1
    }
    {printf "%s%s", FS, $2}
    END {print ""}
' f
1101    7778    7755    8889    6789
2300    1220
4000    2333    7555    9000    1111
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
1

Inspired from Kent's answer.

awk '{
         a[$1]=a[$1] ? a[$1] FS $2 : $2
    } 
END {
        for (key in a) print key,a[key]
    }' FS='\t' OFS='\t' f
Community
  • 1
  • 1
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
0

Another pure bash implementation just for fun, uses string slicing instead of regexps assuming 4 digit numbers only, so it's not robust, but the version of bash I'm working with doesn't have built-in regexp so I was at a loss as to what else I could do!

#!/bin/bash

while read line; do
        array[${line:0:4}]="${array[${line:0:4}]}${line:4:8}"
        indicies[${line:0:4}]=${line:0:4}

done < $1

for i in ${indicies[@]}; do
        echo "$i${array[$i]}"
done
qwwqwwq
  • 6,999
  • 2
  • 26
  • 49