How to compare two LabeledPoints in PySpark?

Question

I have two LabeledPoints - Prediction1 and Prediction2. Both of these LabeledPoints have a value as first element and a prediction as second element. I want to check if the first element in Prediction1 is equal to first element in Prediction2 or not. So something like this:

for each value in Prediction1 and Prediction2:
     if Prediction1.tup[0] != Prediction2.tup[0]:
         print 'Value unequal'
         break

Example:

Suppose following is the RDD of LabeledPoints Prediction1:

[(1,2),(3,4),(5,6)]

Prediction2:

[(1,12),(3,13),(5,2)]

In above example 1st element of each LabeledPoint of Prediction1(1,3,5) is equal to 1st element of each LabeledPoint of Prediction2(1,3,5). But if even one of these didn't matched then I want to exit of the process and print that they don't match and end.

How can I do that in PySpark

Could provide example input an expected output? Your description is rather vague and this pseudo-code doesn't make it better. — zero323, Jan 22 '16 at 09:30

score 1 · Answer 1 · edited May 23 '17 at 11:44

1

Assuming that both RDDs have the same number of partitions and elements per partition you can simply zip and take:

prediction1 = sc.parallelize([(1, 2), (3, 4), (5, 6)])
prediction2 = sc.parallelize([(1, 12), (3, 13), (5, 2)])
prediction3 = sc.parallelize([(1, 0), (5, 0), (5, 0)])

def mismatch(rdd1, rdd2):
    def mismatch_(xy):
        (x1, _), (y1, _) = xy
        return x1 != y1
    return bool(rdd1.zip(rdd2).filter(mismatch_).take(1))

mismatch(prediction1, prediction2)       
## False
mismatch(prediction1, prediction3)
## True

Since take is lazy it should work more or less as you expect. See Lazy foreach on a Spark RDD

If the initial criteria are not met, you can zip manually by combining zipWithIndex, swap (lambda kv: (kv[1], kv[0])) and join.

edited May 23 '17 at 11:44

Community

1
1

answered Jan 22 '16 at 11:37

zero323

322,348
103
959
935

So this will end as soon as the 1st mismatch occurs? – user2966197 Jan 22 '16 at 11:51
Not necessarily but should be close enough. – zero323 Jan 22 '16 at 11:52
the point is that I have huge data - each rdd has about 20 millions LabeldPoints and I don't want the process to keep running for all those numbers even if the 1st mismatch occurs say at 10th LabeledPoint. – user2966197 Jan 22 '16 at 11:57
Long story short this the best thing you can get :) I may require processing more data than is required but you there is not much you can do about it. If you get an early match it is smart enough to stop without processing all data. – zero323 Jan 22 '16 at 12:02

How to compare two LabeledPoints in PySpark?

1 Answers1