Combining tuples based on a field?

Question

Say I have a structure like

{1001, {{id=1001, count=20, key=a}, {id=1001, count=30, key=b}}}
{1002, {{id=1002, count=40, key=a}, {id=1001, count=50, key=b}}}

And I want it transform it into

{id=1001, a=20, b=30}
{id=1002, a=40, b=50}

What Pig commands can I use to do this?

Could you give a schema for the structure you're trying to transform? I don't think you can nest a bag directly inside another bag unless the inner bag is enclosed in a tuple. — cyang, Jul 31 '12 at 21:59

score 1 · Answer 1 · answered Jul 31 '12 at 22:52

Not sure exactly what the format of your starting relation is, but to me it looks like (int, bag:{tuple:(int,int,chararray)})? If so, this should work:

flattened = FOREACH x GENERATE $0 AS id, flatten($1) AS (idx:int, count:int, key:chararray);
a = FILTER flattened BY key == 'a';
b = FILTER flattened BY key == 'b';
joined = JOIN a BY id, b BY id;
result = FOREACH joined GENERATE a::id AS id, a::count AS a, b::count AS b;

score 1 · Answer 2 · edited May 23 '17 at 11:48

It looks like you are pivoting, similar to Pivoting in Pig. But you already have a bag of tuples. Doing inner join will be costly, as it will cause extra Map Reduce Jobs. To do it fast you need filtering within nested foreach. Modified code will look something like:

inpt = load '..../pig/bag_pivot.txt' as (id : int, b:bag{tuple:(id : int, count : int, key : chararray)});

result = foreach inpt {
    col1 = filter b by key == 'a';
    col2 = filter b by key == 'b';
    generate id, flatten(col1.count) as a, flatten(col2.count) as b;
};

Sample input data:

1001    {(1001,20,a),(1001,30,b)}
1002    {(1002,40,a),(1001,50,b)}

Output:

(1001,20,30)
(1002,40,50)

Combining tuples based on a field?

2 Answers2