1

How can I efficiently compare matched cohorts in spark?

In python for each observation of the minority class in a highly imbalanced dataset sampling k observations from the majority class can be implemented in a fairly straightforward way (i.e. matching a healthy person for each sick person by age and gender):

Improve performance calculating a random sample matching specific conditions in pandas or python 1:1 stratified sampling per each group

But how can this be scaled out in spark? Naively, a self-join with filter should work. But this fails due to too many tuples being generated.

Are there smarter strategies? Maybe some smart hashing like LSH?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

0 Answers0