1

While implementing solution provided at How to optimize a group by statement in PIG latin? I found that all the rows with one null column is dropped, which is expected behavior in Pig. I was wondering if the below code would work?

A = B join by ( Bcol1 is null?'UNK',Bcol2 is null?'UNK',Bcol2 is null?999),
C join by ( Ccol1 is null?'UNK',Ccol2 is null?'UNK',Ccol2 is null?999)

I am getting some parsing error.

Community
  • 1
  • 1
Vivek
  • 657
  • 11
  • 14

1 Answers1

3

PIG is a data flow scripting language and adding extra FOREACH GENERATE which fix the null will not cause extra map reduce jobs.

B = foreach B generate ....., (Bcol1 is null) ? 'UNK' : Bcol1 as Bcol1, (Bcol2 is null) ? 'UNK' : Bcol2 as Bcol2, (Bcol3 is null) ? 999 : Bcol3;
C = foreach C generate ....., (Ccol1 is null) ? 'UNK' : Ccol1 as Ccol1, (Ccol2 is null) ? 'UNK' : Ccol2 as Ccol2, (Ccol3 is null) ? 999 : Ccol3;
A = join B by (Bcol1, Bcol2, Bcol3), C by (Ccol1, Ccol2, Ccol3);
alexeipab
  • 3,609
  • 14
  • 16
  • Thanks for responding Alexejpab, I did implement in the same way on my side, however the only downside is that it will essentially change the value from null to some deterministic value, which might present as value.. but I think since there is no NVL function in the join.. it is the way to go.. – Vivek Sep 28 '12 at 17:38
  • You can always do "(Bcol1 is null) ? 'UNK' : Bcol1 as JOIN_Bcol1", so you duplicate the column with altered null values and discard them after join – alexeipab Sep 29 '12 at 10:03