2

I have a dataframe for single-label binary classification with some class imbalance and I want to make a train-test split. Some observations are members of groups in the data that should only appear in either the test split or train split but not both.

Outside of PySpark, I could use StratifiedGroupKFold from sklearn. What is the easiest way to achieve the same effect with PySpark?

I looked at the sampleBy method from PySpark, but I'm not sure how to use it while keeping the groups separate.

Documentation links:

michen00
  • 764
  • 1
  • 8
  • 32

0 Answers0