equivalent of sklearn's StratifiedGroupKFold for PySpark?

Asked Oct 23 '21 at 07:15

Active Oct 23 '21 at 07:15

Viewed 149 times

I have a dataframe for single-label binary classification with some class imbalance and I want to make a train-test split. Some observations are members of groups in the data that should only appear in either the test split or train split but not both.

Outside of PySpark, I could use StratifiedGroupKFold from sklearn. What is the easiest way to achieve the same effect with PySpark?

I looked at the sampleBy method from PySpark, but I'm not sure how to use it while keeping the groups separate.

Documentation links:

StratifiedGroupKFold
sampleBy

asked Oct 23 '21 at 07:15

michen00

equivalent of sklearn's StratifiedGroupKFold for PySpark?

0 Answers0