spark efficient distribution pairing to compare cohorts

Asked Sep 03 '19 at 14:07

Active Sep 03 '19 at 14:07

Viewed 63 times

How can I efficiently compare matched cohorts in spark?

In python for each observation of the minority class in a highly imbalanced dataset sampling k observations from the majority class can be implemented in a fairly straightforward way (i.e. matching a healthy person for each sick person by age and gender):

Improve performance calculating a random sample matching specific conditions in pandas or python 1:1 stratified sampling per each group

But how can this be scaled out in spark? Naively, a self-join with filter should work. But this fails due to too many tuples being generated.

Are there smarter strategies? Maybe some smart hashing like LSH?

asked Sep 03 '19 at 14:07

Georg Heiler

16,916
36
162
292

May be this https://sparkhub.databricks.com/video/real-time-fuzzy-matching-with-spark-and-elastic-search/ – MikiBelavista Sep 03 '19 at 18:59

spark efficient distribution pairing to compare cohorts

0 Answers0