1

Suppose I have a bag NEW that contains many pairs (A, B):

Pair 1: { "A" : { "long" : someInteger1 }, "B" : { "int" : someInteger2 } }

Pair 2: { "A" : { "long" : someInteger3 }, "B" : { "int" : someInteger4 } }

......

I have another bag OLD, which is almost identical to the first bag (it may have a few missing, different, or extra pairs), and I want to compare OLD and NEW by counting how many pairs are the same in both bags. There may be multiple pairs (A, B) within a bag that have the same A or the same B.

Things I have already tried using Pig:

  1. Joining OLD and NEW on a hash code generated from A and B and counting how many have both A and B matching. The count is only about half what I expect.
  2. Joining OLD and NEW on (A, B) and counting how many results there are. The count is only about half what I expect (same as 1 above).
  3. Joining OLD and NEW on A and counting how have B matching. For some reason, the joined result seems to have weird duplicates:

    Result 1: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger2 } }

    Result 2: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger3 } }

    Result 3: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger4 } }

    Result 4: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger5 } }

Brian Schmitz
  • 1,023
  • 1
  • 10
  • 19

1 Answers1

1

DataFu has a awesome library of Pig UDFs that you could use. I think SetDifference() is what you are looking for.

o-90
  • 17,045
  • 10
  • 39
  • 63