6

I have an unbalanced dataframe on spark using PySpark. I want to resample it to make it balanced. I only find the sample function in PySpark

sample(withReplacement, fraction, seed=None)

but I want to sample the dataframe with weight of unitvolume in Python, I can do it like

df.sample(n,Flase,weights=log(unitvolume))

is there any method I could do the same using PySpark?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Xin Chang
  • 87
  • 1
  • 5

3 Answers3

3

Spark provides tools for stratified sampling, but this work only on categorical data. You could try to bucketize it:

from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import col, log

df_log = df.withColumn("log_unitvolume", log(col("unitvolume"))
splits = ... # A list of splits

bucketizer = Bucketizer(splits=splits, inputCol="log_unitvolume", outputCol="bucketed_log_unitvolume")

df_log_bucketed = bucketizer.transform(df_log)

Compute statistics:

counts = df.groupBy("bucketed_log_unitvolume")
fractions  = ...  # Define fractions from each bucket:

and use these for sampling:

df_log_bucketed.sampleBy("bucketed_log_unitvolume", fractions)

You can also try to rescale log_unitvolume to [0, 1] range and then:

from pyspark.sql.functions import rand 

df_log_rescaled.where(col("log_unitvolume_rescaled") < rand())
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Is there any similar way in spark 2.4 ?? I saw that Bucketizer is released from spark 3. – Solat Oct 30 '21 at 09:34
0

I think it will be better to simply ignore the .sample() function altogether. Sampling without replacement can be implemented with a uniform random number generator:

import pyspark.sql.functions as F

n_samples_appx = 100
total_weight = df.agg(F.sum('weight')).collect().values
df.filter(F.rand(seed=843) < F.col('weight') / total_weight * n_samples_appx)

This will randomly include/exclude rows from your dataset, which is typically comparable to sampling with replacement. You should be careful about interpretation if you have RHS that exceeds 1 -- weighted sampling is a nuanced process that, rigorously speaking, should only be performed with-replacement.

So if you want to sample with replacement instead, you can use F.rand() to get samples of the poisson distribution which will tell you how many copies of the row to include, and you can either treat that value as a weight, or do some annoying joins & unions to duplicate your rows. But I find that this is typically not required.

You can also do this in a portable repeatable way with the hash:

import pyspark.sql.functions as F

n_samples_appx = 100
total_weight = df.agg(F.sum('weight')).collect().values
df.filter(F.hash(F.col('id')) % (total_weight / n_samples_appx * F.col('weight')).astype('int') == 0)

This will sample at a rate of 1-in-modulo, which incorporates your weight. hash() should be a consistent and deterministic function, but the sampling will occur like random.

John Haberstroh
  • 465
  • 4
  • 11
-1

One way to do it is to use udf to make a sampling column. This column will have a random number multiplied by your desired weight. Then we sort by the sampling column, and take the top N.

Consider the following illustrative example:

Create Dummy Data

import numpy as np
import string
import pyspark.sql.functions as f

index = range(100)
weights = [i%26 for i in index]
labels = [string.ascii_uppercase[w] for w in weights]

df = sqlCtx.createDataFrame(
    zip(index, labels, weights),
    ('index', 'label', 'weight')
)

df.show(n=5)
#+-----+-----+------+
#|index|label|weight|
#+-----+-----+------+
#|    0|    A|     0|
#|    1|    B|     1|
#|    2|    C|     2|
#|    3|    D|     3|
#|    4|    E|     4|
#+-----+-----+------+
#only showing top 5 rows

Add Sampling Column

In this example, we want to sample the DataFrame using the column weight as the weight. We define a udf using numpy.random.random() to generate uniform random numbers and multiply by the weight. Then we use sort() on this column and use limit() to get the desired number of samples.

N = 10  # the number of samples

def get_sample_value(x):
    return np.random.random() * x

get_sample_value_udf = f.udf(get_sample_value, FloatType())

df_sample = df.withColumn('sampleVal', get_sample_value_udf(f.col('weight')))\
    .sort('sampleVal', ascending=False)\
    .select('index', 'label', 'weight')\
    .limit(N)

Result

As expected, the DataFrame df_sample has 10 rows, and it's contents tend to have letters near the end of the alphabet (higher weights).

df_sample.count()
#10

df_sample.show()
#+-----+-----+------+
#|index|label|weight|
#+-----+-----+------+
#|   23|    X|    23|
#|   73|    V|    21|
#|   46|    U|    20|
#|   25|    Z|    25|
#|   19|    T|    19|
#|   96|    S|    18|
#|   75|    X|    23|
#|   48|    W|    22|
#|   51|    Z|    25|
#|   69|    R|    17|
#+-----+-----+------+
pault
  • 41,343
  • 15
  • 107
  • 149
  • This is a clever idea; however, it will not produce a sample that follows the distribution specified by the weights. Consider a DF with two tuples "A" and "B" with weights 1 and 2 respectively. If you do a little math (or simulation) you will notice that tuple A will be sampled with probability 0.25, which is smaller than expected given the distribution of weights. – Joel Aug 15 '22 at 00:45