0

I need a function on RDD, let's say 'isAllMatched' which will take a predicate as an argument to match. However, I don't want to scan all elements, if predicate fails for any element, it should return false. I also want this function to execute parallely on all worker nodes. Here is the pseudocode:

 def isAllMatched[T : ClassTag](rdd: RDD[T])(pred: T => Boolean) = {
      foreach(ele <- rdd.elements) {
           if(!pred(ele)) return false;
      }
      return true;
 }

Is this possible in Spark ? Is there any built-in function to do that ?

aks
  • 1,019
  • 1
  • 9
  • 17

2 Answers2

0

I don't know existing RDD operation to achieve this result, but you can implement your function like this:

def isAllMatched[T](rdd: RDD[T])(pred: T => Boolean): Boolean =
    rdd.filter(e => !pred(e)).isEmpty
Piotr Kalański
  • 669
  • 1
  • 5
  • 8
  • Yes, but this will scan all the elements. I want to break execution as soon as predicate fails. – aks Jun 19 '17 at 07:51
-1

Piotr's answer is correct. It does exactly what you asked for. Lazy evaluation ensures that the scan stops right-away at the first element failing the predicate test and a boolean false is returned to the driver. Any tasks running on the other nodes on the same operation will be abandoned.