0

Say I have a structure like

{1001, {{id=1001, count=20, key=a}, {id=1001, count=30, key=b}}}
{1002, {{id=1002, count=40, key=a}, {id=1001, count=50, key=b}}}

And I want it transform it into

{id=1001, a=20, b=30}
{id=1002, a=40, b=50}

What Pig commands can I use to do this?

Lucas
  • 1,577
  • 6
  • 18
  • 25
  • Could you give a schema for the structure you're trying to transform? I don't think you can nest a bag directly inside another bag unless the inner bag is enclosed in a tuple. – cyang Jul 31 '12 at 21:59

2 Answers2

1

Not sure exactly what the format of your starting relation is, but to me it looks like (int, bag:{tuple:(int,int,chararray)})? If so, this should work:

flattened = FOREACH x GENERATE $0 AS id, flatten($1) AS (idx:int, count:int, key:chararray);
a = FILTER flattened BY key == 'a';
b = FILTER flattened BY key == 'b';
joined = JOIN a BY id, b BY id;
result = FOREACH joined GENERATE a::id AS id, a::count AS a, b::count AS b;
Joe K
  • 18,204
  • 2
  • 36
  • 58
1

It looks like you are pivoting, similar to Pivoting in Pig. But you already have a bag of tuples. Doing inner join will be costly, as it will cause extra Map Reduce Jobs. To do it fast you need filtering within nested foreach. Modified code will look something like:

inpt = load '..../pig/bag_pivot.txt' as (id : int, b:bag{tuple:(id : int, count : int, key : chararray)});

result = foreach inpt {
    col1 = filter b by key == 'a';
    col2 = filter b by key == 'b';
    generate id, flatten(col1.count) as a, flatten(col2.count) as b;
};

Sample input data:

1001    {(1001,20,a),(1001,30,b)}
1002    {(1002,40,a),(1001,50,b)}

Output:

(1001,20,30)
(1002,40,50)
Community
  • 1
  • 1
alexeipab
  • 3,609
  • 14
  • 16