8

I would like to perform the equivalent of "keep all a in A where a.field == b.field for some b in B" in Apache Pig. I am implementing it like so,

AB_joined = JOIN A by field, B by field;
A2 = FOREACH AB_joined GENERATE A::field as field, A::field2 as field2, A::field3 as field3;

Enumerating all of A's entries is quite silly, and I would rather do something like,

A2 = FOREACH AB_joined GENERATE flatten(A);

However, this doesn't seem to work. Is there some other way I can do something equivalent without enumerating A's fields?

duckworthd
  • 14,679
  • 16
  • 53
  • 68

4 Answers4

5

This should work:

A2 = FOREACH AB_joined GENERATE $0..
ЯegDwight
  • 24,821
  • 10
  • 45
  • 52
Sateesh
  • 51
  • 1
  • 2
3

You can use COGROUP to keep the columns of A separate from columns of B. This is especially useful when A's schema is dynamic and you don't want your code to fail when A's schema changes.

AB = COGROUP A BY field, B BY field;

-- schema of AB will be:
-- {group, A:{all fields of A}, B:{all fields of B}}

A2 = FOREACH AB FLATTEN(A);

Hope this helps.

Gaurav Phapale
  • 979
  • 1
  • 8
  • 21
  • This is nice! However it generates field names with the A:: prefix. Do you know if there is a way to get rid of that? – Hari Menon May 05 '17 at 19:24
  • @HariShankar you can always project the elements using AS.... e.g. A2 = FOREACH AB FLATTEN(A) AS (a1:string, a2:long); – Gaurav Phapale May 24 '17 at 22:19
  • Yes I can do that but that involves rewriting all the fields which means the script needs to be changed whenever a new field is added :( – Hari Menon May 24 '17 at 22:38
2

Starting from at least pig 0.9.1 you can use either Star Expressions or Project-Range Expressions to select multiple fields from tuple. Read Pig Latin 0.15.0, Expressions chapter for details.

Here is my example which I made just to give you understanding.

-- A: {id: long, f1: int, f2: int, f3: int, f4: int}
-- B: {id: long, f5: int}

Let's join A & B and select only A's fields

AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$4;
--AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int}

or

BA = FOREACH (JOIN B BY id, A by id) GENERATE $2..;
--BA: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int}

selecting all fields using Star expression

AB = FOREACH (JOIN A BY id, B by id) GENERATE *;
--AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int, B::id: long, B::f5: int}

selecting all distinct fields (without B::id field) using Project-range expression

AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$4, f5;
--AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int, B::f5: int}

Sometimes it's really useful when you have tens of fields in one relation and only couple in another.

dezhik
  • 990
  • 10
  • 16
  • is there a way to dynamically get the end column count? I am trying to create a generic script that I can pass multiple schemas from shell script. so depending on the schema passed the columns will be different. how can I solve this? – Aaron Sep 28 '16 at 14:49
  • Not sure I understood you right. Do you want to get a number of columns in a relation or want to count all rows having not null value in the last column? – dezhik Sep 28 '16 at 16:15
  • I am passing schema at runtime to pig script based on the parameter that is passed during invoking of my shell script. so, it may be that during one run, that file passed has only 10 columns and in another run file that is passed has 30 columns.. how can I count the number of columns for a given schema? so I can put something like this AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$10, f5; – Aaron Sep 28 '16 at 18:30
  • I am passing schema at runtime to pig script based on the parameter that is passed during invoking of my shell script. so, it may be that during one run, that file passed has only 10 columns and in another run file that is passed has 30 columns.. how can I count the number of columns for a given schema? so I can put something like this AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$10, f5; AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$30, f5; – Aaron Sep 28 '16 at 18:35
  • I don't know the easy way to do this. Because you can't pass pre-calculated parameter into foreach range expression like `BA = FOREACH (JOIN B BY id, A by id) GENERATE $2..$aLength, f5;`, but you can [write your own UDF](https://pig.apache.org/docs/r0.15.0/udf.html) for filtering join result inside it. – dezhik Sep 28 '16 at 20:14
2

With Pig 12 and above, use PluckTuple: https://pig.apache.org/docs/r0.12.0/func.html#plucktuple.

AB_joined = JOIN A by field, B by field;
DEFINE pluck PluckTuple('A::');
A2 = FOREACH AB_joined generate FLATTEN(pluck(*));
Duke Dogs
  • 81
  • 5