2

I have a data set that looks like this:

foo,R
foo,Y
bar,C
foo,R
baz,Y
foo,R
baz,Y
baz,R
...

I'd like to generate a report that sums up the number of 'R', 'Y' and 'C' records for each unique value in the first column. For this data set, it would look like:

foo,3,1,0
bar,0,0,1
baz,1,2,0

Where the 2nd column is the number of 'R' records, the third is the number of 'Y' records and the last is the number of 'C' records.

I know I can first filter by record type, group and aggregate, but that leads to an expensive join of the three sub-reports. I would much rather group once and GENERATE each of the {R, Y, C} columns in my group.

How can I convert the Boolean result of comparing the second column in my data set to 'R', 'Y' or 'C' to a numeric value I can aggregate? Ideally I want 1 for a match and 0 for a non-match for each of the three columns.

AndreiM
  • 4,558
  • 4
  • 36
  • 50

1 Answers1

2

Apache PIG is perfectly adapted for such type of problems. It can be solved with one GROUP BY and one nested FOREACH

inpt = load '~/pig/data/group_pivot.csv' using PigStorage(',') as (val : chararray, cat : chararray);
grp = group inpt by (val);
final = foreach grp {
    rBag = filter inpt by cat == 'R';
    yBag = filter inpt by cat == 'Y';
    cBag = filter inpt by cat == 'C';
    generate flatten(group) as val, SIZE(rBag) as R, SIZE(yBag) as Y, SIZE(cBag) as C;
};

dump final;
--(bar,0,0,1)
--(baz,1,2,0)
--(foo,3,1,0)

bool = foreach final generate val, (R == 0 ? 0 : 1) as R, (Y == 0 ? 0 : 1) as Y, (C == 0 ? 0 : 1) as C;

dump bool;
--(bar,0,0,1)
--(baz,1,1,0)
--(foo,1,1,0)

I have tried it on your example and got the expected result. The idea is that after GROUP BY each value has a BAG that contains all rows with R, Y, C categories. Using FILTER within FOREACH we create 3 separate BAGs (one per R, Y, C) and SIZE(bag) in GENERATE counts the number of rows in each bag.

The only problem you might encounter is when there are too many rows with the same value in val column, as nested FOREACH relies on in memory operations and resulting intermidiate BAGs could get quite large. If you start getting memory related exceptions, then you can inspire from How to handle spill memory in pig. The idea would be to use 2 GROUP BY operations, first one to get counts per (val, cat) and second to pivot R, Y, C around val, thus avoiding expensive JOIN operation (see Pivoting in Pig).

Regarding the question with BOOLEAN: I have used bincond operator. If you do not need the counts, you could use IsEmpty(bag) instead of SIZE(bag), it would be slightly faster and bincond to get your 0 and 1 conversions.

Community
  • 1
  • 1
alexeipab
  • 3,609
  • 14
  • 16