1

I have some set of numbers that describes connections between the first set of integers and the second set of integers. For example:

1,2
3,4
5,6
5,7
6,8

I then load my data as follows, and group it:

data = load 'data.csv' as integer_1, integer_2;
grouped = group data by integer_1;

grouped_numbers = foreach grouped generate group as node, data.integer_2 as connection;

Which then yields a bag with each first integer and its first-degree connections:

(1,{(2)})
(3,{(4)})
(5,{(6),(7)})
(6,{(8)})

I would then like to do a self-join of the grouped_numbers bag, in order to give the resultant first integer with each of its first- and second-degree connections. In this case, that would be:

(1,{(2)})
(3,{(4)})
(5,{(6),(7),(8)})
(6,{(8)})

because 5 is connected to 6, which is connected to 8, so 8 is a second-degree connection of 6. How would I implement this in Pig?

orange1
  • 2,871
  • 3
  • 32
  • 58

1 Answers1

0

First join:

    joined = join data1 by int_2, data2 by int_1

where data1 and data2 are the same set (copies of data in this example).

then group by the first field. The inner bag will have all the connections to the 'group', possibly more than once. So you might need a distinct on the inner bags as well, if you just one the unique elements.

(answered via a Pig mailing list)

orange1
  • 2,871
  • 3
  • 32
  • 58