3

I currently need to randomly sample items in a RDD in Spark for k elements. I noticed that there is the takeSample method. The method signature is as follows.

takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T] 

However, this does not return a RDD. There is another sampling method that does return a RDD, sample.

sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]

I don't want to use the first method takeSample because it does not return a RDD and will pull a significant amount of data back to the driver program (memory issues). I went ahead and used the sample method, but I had to compute the fraction (percentage) as follows.

val rdd = sc.textFile("some/path") //creates the rdd
val N = rdd.count() //total items in the rdd
val fraction = k / N.toDouble
val sampledRdd = rdd.sample(false, fraction, 67L)

The problem with this approach/method is that I may not be able to get a RDD with exactly k items. For example, if we assume N = 10, then

  • k = 2, fraction = 20%, sampled items = 2
  • k = 3, fraction = 30%, sampled items = 3
  • and so on

But with N = 11, then

  • k = 2, fraction = 18.1818%, sampled items = ?
  • k = 3, fraction = 27.2727%, sampled items = ?

In the last example, for fraction = 18.1818%, how many items will be in the resulting RDD?

Also, this is what the documentation says about the fraction argument.

expected size of the sample as a fraction of this RDD's size 
 - without replacement: probability that each element is chosen; fraction must be [0, 1] 
 - with replacement: expected number of times each element is chosen; fraction must be greater than or equal to 0

Since I have chose without replacement, it seems that my fraction should be computed as follows. Note that each item has equal probability to be selected (which is what I'm trying to express).

val N = rdd.count()
val fraction = 1 / N.toDouble
val sampleRdd = rdd.sample(false, fraction, 67L)

So, is it k / N or 1 / N? It seems as if the documentation is pointing in all different directions with sample size and sampling probability.

And lastly, the documentation notes.

This is NOT guaranteed to provide exactly the fraction of the count of the given RDD.

Which, then brings me back to my original question/concern: if the RDD API doesn't guarantee sampling exactly k items from an RDD, how do we efficiently do so?

As I was writing this post, I discovered there is already another SO post asking nearly the same question. I found the accepted answer unacceptable. Here, I also wanted to clarify the fraction argument.

I wonder if there is a way to do so using Datasets and DataFrames?

Community
  • 1
  • 1
Jane Wayne
  • 8,205
  • 17
  • 75
  • 120

1 Answers1

2

This solution is not that beautiful but I hope it would be helpful for thinking. The trick is using an additional score and get the kth largest score as the threshold.

val k = 100
val rdd = sc.parallelize(0 until 1000)
val rddWithScore = rdd.map((_, Math.random))
rddWithScore.cache()
val threshold = rddWithScore.map(_._2)
  .sortBy(t => t)
  .zipWithIndex()
  .filter(_._2 == k)
  .collect()
  .head._1
val rddSample = rddWithScore.filter(_._2 < threshold).map(_._1)
rddSample.count()

The output would be

k: Int = 100
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[58] at parallelize at <console>:31
rddWithScore: org.apache.spark.rdd.RDD[(Int, Double)] = MapPartitionsRDD[59] at map at <console>:32
threshold: Double = 0.1180443408900893
rddSample: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[69] at map at <console>:40
res10: Long = 100
Mo Tao
  • 1,225
  • 1
  • 8
  • 17