0

In PySpark I have an RDD composed by (key;value) pairs where keys are sequential integers and values are floating point numbers.

I'd like to sample exactly one element from this RDD with probability proportional to value.

In a naiive manner, this task can be accomplished as follows:

pairs = myRDD.collect()   #now pairs is a list of (key;value) tuples
K, V = zip(*pairs)        #separate keys and values
V = numpy.array(V)/sum(V) #normalise probabilities
extractedK = numpy.random.choice(K,size=1,replace=True, p=V)

What concerns me is the collect() operation which, as you might know, loads the entire list of tuples in memory, which can be quite expensive. I'm aware of takeSample(), which is great when elements should be extracted uniformly, but what happens if elements should be extracted according to weighted probabilities?

Thanks!

AlessioX
  • 3,167
  • 6
  • 24
  • 40

2 Answers2

3

Here is an algorithm I worked out to do this:

EXAMPLE PROBLEM

Assume we want to sample 10 items from an RDD on 3 partitions like this:

  • P1: ("A", 0.10), ("B", 0.10), ("C", 0.20)
  • P2: ("D": 0.25), ("E", 0.25)
  • P3: ("F", 0.10)

Here is the high-level algorithm:

INPUT: number of samples and a RDD of items (with weights)

OUTPUT: dataset sample on driver

  1. For each partition, calculate the total probability of sampling from the partition, and aggregate those values to the driver.
    • This would give the probability distribution: Prob(P1) = 0.40, Prob(P2) = 0.50, Prob(P3) = 0.10
  2. Generate a sample of the partitions (to determine the number of elements to select from each partition.)
    • A sample may look like this: [P1, P1, P1, P1, P2, P2, P2, P2, P2, P3]
    • This would give us 4 items from P1, 5 items from P2, and 1 item from P3.
  3. On each separate partition, we locally generate a sample of the needed size using only the elements on that partition:
    • On P1, we would sample 4 items with the (re-normalized) probability distribution: Prob(A) = 0.25, Prob(B) = 0.25, Prob(C) = 0.50. This could yield a sample such as [A, B, C, C].
    • On P2, we would sample 5 items with probability distribution: Prob(D) = 0.5, Prob(E) = 0.5. This could yield a sample such as [D,D,E,E,E]
    • On P3: sample 1 item with probability distribution: P(F) = 1.0, this would generate the sample [E]
  4. Collect the samples to the driver to yield your dataset sample [A,B,C,C,D,D,E,E,E,F].

Here is an implementation in scala:

  case class Sample[T](weight: Double, obj: T)

  /*
   *  Obtain a sample of size `numSamples` from an RDD `ar` using a two-phase distributed sampling approach.
   */
  def sampleWeightedRDD[T:ClassTag](ar: RDD[Sample[T]], numSamples: Int)(implicit sc: SparkContext): Array[T] = {
    // 1. Get total weight on each partition
    var partitionWeights = ar.mapPartitionsWithIndex{case(partitionIndex, iter) => Array((partitionIndex, iter.map(_.weight).sum)).toIterator }.collect().toArray

    //Normalize to 1.0
    val Z = partitionWeights.map(_._2).sum
    partitionWeights = partitionWeights.map{case(partitionIndex, weight) => (partitionIndex, weight/Z)}

    // 2. Sample from partitions indexes to determine number of samples from each partition
    val samplesPerIndex = sc.broadcast(sample[Int](partitionWeights, numSamples).groupBy(x => x).mapValues(_.size).toMap).value

    // 3. On each partition, sample the number of elements needed for that partition
    ar.mapPartitionsWithIndex{case(partitionIndex, iter) => 
      val numSamplesForPartition = samplesPerIndex.getOrElse(partitionIndex, 0)
      var ar = iter.map(x => (x.obj, x.weight)).toArray

      //Normalize to 1.0
      val Z = ar.map(x => x._2).sum
      ar = ar.map{case(obj, weight) => (obj, weight/Z)}
      sample(ar, numSamplesForPartition).toIterator
    }.collect()
  }

This code using a simple weighted sampling function sample:

 // a very simple weighted sampling function 
  def sample[T:ClassTag](dist: Array[(T, Double)], numSamples: Int): Array[T] = {

    val probs = dist.zipWithIndex.map{case((elem,prob),idx) => (elem,prob,idx+1)}.sortBy(-_._2)
    val cumulativeDist = probs.map(_._2).scanLeft(0.0)(_+_).drop(1)
    (1 to numSamples).toArray.map(x => scala.util.Random.nextDouble).map{case(p) => 

      def findElem(p: Double, cumulativeDist: Array[Double]): Int = {
        for(i <- (0 until cumulativeDist.size-1)) 
          if (p <= cumulativeDist(i)) return i
        return cumulativeDist.size-1
      }

      probs(findElem(p, cumulativeDist))._1
    }
  }
anthonybell
  • 5,790
  • 7
  • 42
  • 60
  • This just blows up for me saying it can't serialize SparkContext. Is there some working sample somewhere on the internet? – Chris Olivier Oct 26 '18 at 00:00
1

This is basically doable, but you should really consider whether it makes sense to use Spark for this. If you need to draw random values, then you presumeably need to do so many times over in a loop. Each iteration will require scanning through all the data (maybe more than once).

So fitting the data you need into memory and then drawing values from it randomly is almost certainly the right way to go. If your data is really too big to fit into memory, consider (a) only collecting the columns you need for this purpose and (b) whether your data can be binned in a way that makes sense.

Having said that, it is doable within Spark. Below is pysaprk code to demonstrate the idea.

import random
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# read some sample data (shown below)
df = spark.read.csv("prb.csv",sep='\t',inferSchema=True,header=True)
# find the sum of the value column
ss = df.groupBy().agg( F.sum("vl").alias("sum") ).collect()
# add a column to store the normalized values
q = df.withColumn("nrm_vl", (df["vl"] / ss[0].sum) )
w = Window.partitionBy().orderBy("nrm_vl")\
          .rowsBetween(Window.unboundedPreceding, Window.currentRow)
q = q.select("*", F.sum("nrm_vl").over(w).alias("cum_vl"))
q.show()
+---+---+-------------------+-------------------+
| ky| vl|             nrm_vl|             cum_vl|
+---+---+-------------------+-------------------+
|  2|0.8|0.07079646017699115|0.07079646017699115|
|  3|1.1|0.09734513274336283|0.16814159292035397|
|  4|1.7|0.15044247787610618| 0.3185840707964601|
|  0|3.2| 0.2831858407079646| 0.6017699115044247|
|  1|4.5| 0.3982300884955752| 0.9999999999999999|
+---+---+-------------------+-------------------+

def getRandVl(q):
    # choose a random number and find the row that is
    # less than and nearest to the random number
    # (analog to `std::lower_bound` in C++)
    chvl = q.where( q["cum_vl"] > random.random() ).groupBy().agg(
        F.min(q["cum_vl"]).alias("cum_vl") )
    return q.join(chvl, on="cum_vl", how="inner")
# get 30 random samples.. this is already slow
# on a single machine.
for i in range(0,30):
    x = getRandVl(q)
    # add this row. there's no reason to do this (it's slow)
    # except that it's convenient to count how often each
    # key was chosen, to check if this method works
    cdf = cdf.select(cdf.columns).union(x.select(cdf.columns))

# count how often we picked each key
cdf.groupBy("ky","vl").agg( F.count("*").alias("count") ).show()
+---+---+-----+                                                                 
| ky| vl|count|
+---+---+-----+
|  4|1.7|    4|
|  2|0.8|    1|
|  3|1.1|    3|
|  0|3.2|   11|
|  1|4.5|   12|
+---+---+-----+

I think these counts are reasonable given the values. I'd rather test it with many more samples, but it's too slow.

Corey
  • 1,845
  • 1
  • 12
  • 23