0

I have say two Pig variables, p which is (id: int, companies: tuple(name:chararray)) and q which is (id: int, company: chararray).

Now after I join p and q by their "id"'s, how do I filter out those rows where q::company is not present in p::companies?

PS I went through this question Check if an element is present in a bag? but it seems to be not exactly as my problem.

Example

sample p

1,(c1 c2 c3)

2,(c4 c5 c6)

3,(c2 c3 c5)

sample q

1,c3

2,c8

3,c5

expected output after the joins

1,c3

3,c5

Community
  • 1
  • 1
Roy
  • 575
  • 2
  • 8
  • 17

1 Answers1

0

First, you need to convert p so that every combination of ID and company name appears on its own line.

p_flattened = FOREACH p GENERATE 
    id, 
    FLATTEN(TOKENIZE(companies.name, ' ')) AS company;
dump p_flattened;
(1,c1)
(1,c2)
(1,c3)
(2,c4)
(2,c5)
(2,c6)
(3,c2)
(3,c3)
(3,c5)

Then join with q to return only IDs and names which appear in both relations and do foreach to get rid of the duplicate fields.

pq_joined = JOIN p_flattened BY (id, company), q BY (id, company);
final = FOREACH pq_joined GENERATE 
    q::id AS id, 
    q::company AS company;

dump final;
(1,c3)
(3,c5)
savagedata
  • 713
  • 1
  • 5
  • 10