Spark: Efficient way to test if an RDD is empty

Question

There is not an isEmpty method on RDD's, so what is the most efficient way of testing if an RDD is empty?

score 29 · Accepted Answer · edited Sep 06 '17 at 13:26

RDD.isEmpty() will be part of Spark 1.3.0.

Based on suggestions in this apache mail-thread and later some comments to this answer, I have done some small local experiments. The best method is using take(1).length==0.

def isEmpty[T](rdd : RDD[T]) = {
  rdd.take(1).length == 0 
}

It should run in O(1) except when the RDD is empty, in which case it is linear in the number of partitions.

Thanks to Josh Rosen and Nick Chammas to point me to this.

Note: This fails if the RDD is of type RDD[Nothing] e.g. isEmpty(sc.parallelize(Seq())), but this is likely not a problem in real life. isEmpty(sc.parallelize(Seq[Any]())) works fine.

Edits:

Edit 1: Added take(1)==0 method, thanks to comments.

My original suggestion: Use mapPartitions.

def isEmpty[T](rdd : RDD[T]) = {
  rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_) 
}

It should scale in the number of partitions and is not nearly as clean as take(1). It is however robust to RDD's of type RDD[Nothing].

Experiments:

I used this code for the timings.

def time(n : Long, f : (RDD[Long]) => Boolean): Unit = {
  val start = System.currentTimeMillis()
  val rdd = sc.parallelize(1L to n, numSlices = 100)
  val result = f(rdd)
  printf("Time: " + (System.currentTimeMillis() - start) + "   Result: " + result)
}

time(1000000000L, rdd => rdd.take(1).length == 0L)
time(1000000000L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_))
time(1000000000L, rdd => rdd.count() == 0L)
time(1000000000L, rdd => rdd.takeSample(true, 1).isEmpty)
time(1000000000L, rdd => rdd.fold(0)(_ + _) == 0L)

time(1L, rdd => rdd.take(1).length == 0L)
time(1L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_))
time(1L, rdd => rdd.count() == 0L)
time(1L, rdd => rdd.takeSample(true, 1).isEmpty)
time(1L, rdd => rdd.fold(0)(_ + _) == 0L)

time(0L, rdd => rdd.take(1).length == 0L)
time(0L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_))
time(0L, rdd => rdd.count() == 0L)
time(0L, rdd => rdd.takeSample(true, 1).isEmpty)
time(0L, rdd => rdd.fold(0)(_ + _) == 0L)

On my local machine with 3 worker cores I got these results

Time:    21   Result: false
Time:    75   Result: false
Time:  8664   Result: false
Time: 18266   Result: false
Time: 23836   Result: false

Time:   113   Result: false
Time:   101   Result: false
Time:    68   Result: false
Time:   221   Result: false
Time:    46   Result: false

Time:    79   Result: true
Time:    93   Result: true
Time:    79   Result: true
Time:   100   Result: true
Time:    64   Result: true

Spark recently merged a pull request to add an `isEmpty` method to RDDs: https://github.com/apache/spark/pull/4074 — Josh Rosen, Feb 11 '15 at 17:07
Good news. The pull request actually contained a bug. I've sent a fix in https://github.com/apache/spark/pull/4534 — Tobber, Feb 11 '15 at 18:31
Tobber, isn't it easier and just as fast to do `.take(1)` on the RDD and see if the result is empty? — Nick Chammas, Feb 11 '15 at 19:42
@NickChammas. In short yes. There is though a bug when your RDD is of type `RDD[Nothing]`. This is very much an edge-case however, since an `RDD[Nothing]` is practically useless. We actually have a discussion going on, on the second pull request. — Tobber, Feb 11 '15 at 20:30

score 3 · Answer 2 · answered Dec 08 '15 at 21:53

As of Spark 1.3 the isEmpty() is part of the RDD api. A fix that was causing isEmpty to fail was later fixed in Spark 1.4.

For DataFrames you can do:

val df: DataFrame = ...
df.rdd.isEmpty()

Here is paste of the code right off from the RDD implementation (as of 1.4.1).

  /**
   * @note due to complications in the internal implementation, this method will raise an
   * exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice
   * because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`.
   * (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.)
   * @return true if and only if the RDD contains no elements at all. Note that an RDD
   *         may be empty even when it has at least 1 partition.
   */
  def isEmpty(): Boolean = withScope {
    partitions.length == 0 || take(1).length == 0
  }

Spark: Efficient way to test if an RDD is empty

2 Answers2

Edits:

Experiments:

Linked