Suppose I have a bag NEW that contains many pairs (A, B):
Pair 1: { "A" : { "long" : someInteger1 }, "B" : { "int" : someInteger2 } }
Pair 2: { "A" : { "long" : someInteger3 }, "B" : { "int" : someInteger4 } }
......
I have another bag OLD, which is almost identical to the first bag (it may have a few missing, different, or extra pairs), and I want to compare OLD and NEW by counting how many pairs are the same in both bags. There may be multiple pairs (A, B) within a bag that have the same A or the same B.
Things I have already tried using Pig:
- Joining OLD and NEW on a hash code generated from A and B and counting how many have both A and B matching. The count is only about half what I expect.
- Joining OLD and NEW on (A, B) and counting how many results there are. The count is only about half what I expect (same as 1 above).
Joining OLD and NEW on A and counting how have B matching. For some reason, the joined result seems to have weird duplicates:
Result 1: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger2 } }
Result 2: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger3 } }
Result 3: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger4 } }
Result 4: { "A_new" : { "long" : someInteger1 }, "B_new" : { "int" : someInteger2 }, "A_old" : { "long" : someInteger1 }, "B_old" : { "int" : someInteger5 } }