0

I'd like to use gawk to group some data and do calculations on a csv file.

Raw sample data:

2600,AEIOU-2600,stack,2,04/01/2015,C C S,S,10.65
2600,AEIOU-2600,stack,3,04/20/2015,C C R,S,100
2600,AEIOU-2600,stack,1,04/28/2015,C C R,S,1.07
2600,AEIOU-2600,stack,4,04/29/2015,C C R,S,200
2601,"over, L.P. - 00001",stack,0,04/01/2015,C C S,s,50
2601,"over, L.P. - 00001",stack,1,04/01/2015,C C S,s,16.43
2601,"over, L.P. - 00001",stack,2,04/10/2015,D C S,s,17.16
2602,UEIA,stack,2,04/19/2015,C C S,s,500
2602,UEIA,stack,2,04/20/2015,C C S,s,50
2602,UEIA,stack,1,04/28/2015,C C S,s,10
2602,UEIA,stack,2,04/28/2015,C C S,s,30
2602,UEIA,stack,2,04/29/2015,C C S,s,40
2603,EDM,stack,1,04/01/2015,A,S,100
2603,EDM,stack,1,04/03/2015,A,S,100
2603,EDM,stack,1,04/04/2015,A,S,300
2603,EDM,stack,1,04/05/2015,A,S,600
2603,EDM,stack,1,04/06/2015,A,S,50

Field $1: numeric

Field $2: name

Field $3: account

Field $4: invoice number

Field $5: date

Field $6: type

Field $7: state

Field $8: amount

Each row in the csv file needs to be grouped by $1, but also all like fields in field $6 need to be added together only if they are NOT C C R or A.

Expected output:

2600,AEIOU-2600,C C R,3
2600,AEIOU-2600,C C S,1,10.65
2601,"over, L.P. - 00001",C C S,2,66.43
2601,"over, L.P. - 000001",D C S,1,17.16
2602,UEIA,C C S,5,630
2603,EDM,A,4

As wee see, all the like types are grouped, counted, and added together. The only exception is if the type is A or C C R, those are just counted.

Field $1: numeric

Field $2: name

Field $3: type

Field $4: count of type

Field $5: addition of types

I'll have to use gawk because some of the name fields contain double quotes.

I have this but it doesn't group by $1 AND $6

#!/usr/local/bin/gawk -f

    BEGIN {
    FPAT = "\"[^\"]*\"|[^,]*"

    OFS = ","
    }
    NR > 1 {
        arr[$1 OFS $2 OFS $6 OFS $8]++
    }
    END {
        for (key in arr)
            print key, arr[key]
    }

Output:

2600,AEIOU-2600,C C R,1.07,1
2600,AEIOU-2600,C C R,100,1
2600,AEIOU-2600,C C R,200,1
2601,"over, L.P. - 00001",C C S,16.43,1
2601,"over, L.P. - 00001",C C S,50,1
2601,"over, L.P. - 00001",D C S,17.16,1
2602,UEIA,C C S,10,1
2602,UEIA,C C S,30,1
2602,UEIA,C C S,40,1
2602,UEIA,C C S,50,1
2602,UEIA,C C S,500,1
2603,EDM,A,100,2
2603,EDM,A,300,1
2603,EDM,A,50,1
2603,EDM,A,600,1
Registered User
  • 255
  • 4
  • 12
  • Hi Ed, okay, I'll do that. – Registered User May 05 '15 at 15:44
  • Hmm, this is (more than) basically a duplicate. (I know Ed suggested to ask a new question, I wouldn't had done so). On the other hand it is far too localized. Too localized means that the question itself + possible answers are usually interesting for you only. In that case it would be a misuse of this site, since we are not a code writing service. I expect you are working in the industry, so you got paid for the results of this question. Why should we do the work? – hek2mgl May 05 '15 at 15:49
  • Hi hek2mgl, I think this may be interesting to others as it asks how to group by more than one column and sum some data together. I'm not getting paid to do this, but I understand if I'm abusing SO and its wonderful contributors. – Registered User May 05 '15 at 15:57
  • RegisteredUser - if you want to feel better, I can send you my PayPal account info :-). @hek2mgl I suggested a new question as I could not figure out the new requirements from the old one after it was edited, the problem did sound quite different to the original, and a new question has a much better chance of other people looking at it than an edited old question. – Ed Morton May 05 '15 at 16:03
  • @EdMorton Basically this is the result of a question which is basically a close candidate. However, I don't care too much on this, it just felt weird to read almost the same question as in the morning again. – hek2mgl May 05 '15 at 16:06
  • 2
    @RegisteredUser Just another tipp. If your question is just about grouping in general, note that guys who answer such questions really like examples containing words out of `foo,bar,hello,world,1,2,3` while something like `AEIOU-2600,C C R,1.07` sucks! .. If you want the question to be more interesting for others I would encourage you to simplify your questions in that way. But be careful! Don't simplify which can't be simplified, meaning don't remove essential information. – hek2mgl May 05 '15 at 16:10
  • 1
    @hek2mgl the problem is the OP followed one piece of the advice I gave, post a new question, but didn't follow the other which was to put his previous question back to the way it started. Had he done that you would not have felt like you were looking at the same question twice in a day as the previous question was originally quite different. – Ed Morton May 05 '15 at 16:13

1 Answers1

3
$ cat tst.awk
BEGIN { FPAT="([^,]*)|(\"[^\"]+\")"; OFS="," }
{
    cnt[$1][$6]++
    sum[$1][$6]+=$NF
    name[$1][$6]=$2
}
END {
    for (numeric in cnt) {
        for (type in cnt[numeric]) {
            print numeric, name[numeric][type], type, cnt[numeric][type] (type ~ /^(C C R|A)$/? "" : OFS sum[numeric][type])
        }
    }
}

$ awk -f tst.awk file
2600,AEIOU-2600,C C R,3
2600,AEIOU-2600,C C S,1,10.65
2601,"over, L.P. - 00001",D C S,1,17.16
2601,"over, L.P. - 00001",C C S,2,66.43
2602,UEIA,C C S,5,630
2603,EDM,A,5
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Hi Ed, could you break it down and tell me what's going on in it? Are those arrays and then printing it out? – Registered User May 05 '15 at 16:06
  • 1
    `cnt`, `sum`, and `name`, are 2-D arrays indexed by $1 then $6 and in the END I'm just looping through every $1 then the associated $6 and printing the array contents for each combination. See http://www.gnu.org/software/gawk/manual/gawk.html#Arrays-of-Arrays – Ed Morton May 05 '15 at 16:08
  • what's your PP address? – Registered User May 05 '15 at 19:18
  • 1
    I was kidding, no payment required. Thanks for the thought though :-). – Ed Morton May 05 '15 at 19:19
  • Turns out I need to have the state included. Do you see any problems with this? http://paste2.org/AYYVf8aF – Registered User May 06 '15 at 04:32
  • It'll work IF there's a space at the end of the `C C R|A` fields, otherwise get rid of the space before `$` when testing that field. Why would you do the outer loop on the first-level indices of the `cnt` array and then do the inner loop on the second-level indices of `state[]` instead of `cnt[]` again though? – Ed Morton May 06 '15 at 12:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77100/discussion-between-registered-user-and-ed-morton). – Registered User May 06 '15 at 14:49