How can I efficiently compare matched cohorts in spark?
In python for each observation of the minority class in a highly imbalanced dataset sampling k
observations from the majority class can be implemented in a fairly straightforward way (i.e. matching a healthy person for each sick person by age and gender):
Improve performance calculating a random sample matching specific conditions in pandas or python 1:1 stratified sampling per each group
But how can this be scaled out in spark? Naively, a self-join with filter should work. But this fails due to too many tuples being generated.
Are there smarter strategies? Maybe some smart hashing like LSH?