6

Given this sample input:

ID     Sample1     Sample2      Sample3
One      10          0            5
Two      3           6            8
Three    3           4            7

I needed to produce this output using AWK:

ID    Sample1 Sample2 Sample3
One   62.50   0.00    25.00
Two   18.75   60.00   40.00
Three 18.75   40.00   35.00

This is how I solved it:

function percent(value, total) {
    return sprintf("%.2f", 100 * value / total)
}
{
    label[NR] = $1
    for (i = 2; i <= NF; ++i) {
        sum[i] += col[i][NR] = $i
    }
}
END {
    title = label[1]
    for (i = 2; i <= length(col) + 1; ++i) {
        title = title "\t" col[i][1]
    }
    print title
    for (j = 2; j <= NR; ++j) {
        line = label[j]
        for (i = 2; i <= length(col) + 1; ++i) {
            line = line "\t" percent(col[i][j], sum[i])
        }
        print line
    }
}

This works fine in GNU AWK (awk in Linux, gawk in BSD), but not in BSD AWK, where I get this error:

$ awk -f script.awk sample.txt
awk: syntax error at source line 7 source file script.awk
 context is
          sum[i] += >>>  col[i][ <<<
awk: illegal statement at source line 7 source file script.awk
awk: illegal statement at source line 7 source file script.awk

It seems the problem is with the multidimensional arrays. I'd like to make this script work in BSD AWK too, so it's more portable.

Is there a way to change this to make it work in BSD AWK?

janos
  • 120,954
  • 29
  • 226
  • 236

2 Answers2

4

Try using the pseudo-2-dimensional form. Instead of

col[i][NR]

use

col[i,NR]

That is a 1-dimensional array, the key is the concatenated string: i SUBSEP NR

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
3

@glenn's answer got me on the right path. It took a bit more work though:

  • Using col[i, NR] made dealing with the column titles troublesome. It helped a lot to remove the buffering of the column titles and print them immediately after reading
  • length(col) + 1 was no longer usable in the final loop condition, as using col[i, j] made the loops infinite. As a workaround, I could replace length(col) + 1 with simply NF

Here's the final implementation, which now works in both GNU and BSD version of AWK:

function percent(value, total) {
    return sprintf("%.2f", 100 * value / total)
}
BEGIN { OFS = "\t" }
NR == 1 { gsub(/ +/, OFS); print }
NR != 1 {
    label[NR] = $1
    for (i = 2; i <= NF; ++i) {
        sum[i] += col[i, NR] = $i
    }
}
END {
    for (j = 2; j <= NR; ++j) {
        line = label[j]
        for (i = 2; i <= NF; ++i) {
            line = line OFS percent(col[i, j], sum[i])
        }
        print line
    }
}
Community
  • 1
  • 1
janos
  • 120,954
  • 29
  • 226
  • 236