-2

I have a bunch of csv files, each is an 84 x 84 matrix of numbers. I'm trying to use awk to sum all of the cells (to generate a single number). So far all I've been able to come up with is the following, which can sum a single column at a time (for example, column 75), but not all columns together:

awk -F ',' '{sum += $75} END {print sum}' file_name.csv

Then, I would like to create a new csv file in the same directory, where each column is the sum of that column from the previous csv divided by the total sum generated by the previous awk command. So in other words, a csv with only 1 row, where each column has a number which is column sum/total sum.

Any help would be massively appreciated!

BennyD
  • 55
  • 7
  • What output do you want? One number, total sum? Or one row with sums of each columns and total sum? There are plenty of examples online of how to do this, of how to sum all numbers in a column and/or how to sum all numbers in a row. – thanasisp Jul 19 '20 at 12:56
  • Hi thank you! I'm hoping for the output to be a csv file, with 1 row and 84 columns, where each column is a single number, calculated as column sum/total sum. – BennyD Jul 19 '20 at 13:03

1 Answers1

2

If the required final output is simply a single line CSV file with the column sum divided by the total sum for each of the columns in the input, then this should do the job with a single awk command.

{
    for (i = 1; i <= NF; i++)
    {
        colsum[i] += $i
        totsum += $i
        if (NF > maxfld)
            maxfld = NF
    }
}
END {
    pad = ""
    for (i = 1; i <= maxfld; i++)
    {
        printf("%s%.2f", pad, colsum[i] / totsum)
        pad = ","
    }
    print ""
        
}

I'd store that in a file such as script.awk and run:

awk -F, -f script.awk data

Given sample input (8 rows and 8 columns) — a set of random numbers between 10 and 99:

34,98,18,16,62,86,21,37
39,10,62,33,81,16,70,36
23,23,56,16,86,82,30,74
18,10,42,46,99,93,83,76
90,10,76,50,12,24,13,96
11,40,89,92,31,71,25,90
87,82,33,24,32,25,98,17
86,50,70,33,93,30,98,67

the output is:

0.12,0.10,0.13,0.09,0.15,0.13,0.13,0.15

Clearly, you can tweak the format used to present the final values; I chose 2 decimal places, but you can choose any format you prefer.

The question asks about an 84x84 matrix; this code works with such matrices too. The matrices don't have to be square. The input data doesn't even have to have the same number of fields in each line. You could add validation to insist on either condition, or both. If you need the intermediate results files, you could arrange for this code to generate them too.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278