1

I would like to run splits on csv files in unix and run aggregates on some columns. I want to group by on several columns if possible on each of the split up files using awk.

Does anyone know some unix magic that can do this?

here is a sample file:

customer_id,location,house_hold_type,employed,income
123,Florida,Head,true,100000
124,NJ,NoHead,false,0
125,Florida,NoHead,true,120000
126,Florida,Head,true,72000
127,NJ,Head,false,0

I want to get counts grouping on location, house_hold_type as well as AVG(income) for the same group by conditions.

How can I split a file and run awk with this?

this is the output I expect the format of the output could be different but this is the overall data structure I am expecting. Will humbly accept other ways of presenting the information:

location:[counts:['Florida':3, 'NJ':2], income_avgs:['Florida':97333, 'NJ':0]]
house_hold_type:[counts:['Head':3, 'NoHead':2], income_avgs:['Head':57333, 'NoHead':60000]]

Thank you in advance.

Horse Voice
  • 8,138
  • 15
  • 69
  • 120
  • 2
    Can you show what output you expect please? – Mark Setchell Apr 22 '14 at 21:16
  • 1
    Another thing to think about - if you want SQL behavior, why not try loading your data into a temporary table and really use SQL? For example, Postgres is decent about auto-typing columns. I don't see any dates in your input to massage and your booleans are already formatted as `true`/`false`. All you'd really need to do is quote the strings then provide the enclosing SQL to make the temporary table. – n0741337 Apr 22 '14 at 21:40
  • You may also want to try the `R` programming language. The one use case at which it really excels (no pun intended) is at loading tables of data and manipulating them in every way imaginable. The `tapply` function, for example, could be used to do exactly what you want here. – dg99 Apr 22 '14 at 23:39
  • Thank you. I have updated the example to provide my expectation of what this script should put out. Loading to a database table will require additional IO and slow speeds. I plan to run this script dynamically while streaming the data. – Horse Voice Apr 23 '14 at 01:37

1 Answers1

1

awk deals best with columns of data, so the input format is fine. The output format could be managed, but it will be much simpler to output it in columns as well:

#set the input and output field separators to comma
BEGIN {
    FS = ",";
    OFS = FS;
}

#skip the header row
NR == 1 {
    next;
}

#for all remaining rows, store counters and sums for each group
{
    count[$2,$3]++;
    sum[$2,$3] += $5;
}

#after all data, display the aggregates
END {
    print "location", "house_hold_type", "count", "avg_income";
    #for every key we encountered
    for(i in count) {
        #split the key back into "location" and "house_hold_type"
        split(i,a,SUBSEP);
        print a[1], a[2], count[i], sum[i] / count[i];
    }
}

Sample input:

customer_id,location,house_hold_type,employed,income
123,Florida,Head,true,100000
124,NJ,NoHead,false,0
125,Florida,NoHead,true,120000
126,Florida,Head,true,72000
127,NJ,Head,false,0

and output:

location,house_hold_type,count,avg_income
Florida,Head,2,86000
Florida,NoHead,1,120000
NJ,NoHead,1,0
NJ,Head,1,0
laindir
  • 1,650
  • 1
  • 14
  • 19