I have a dataframe for single-label binary classification with some class imbalance and I want to make a train-test split. Some observations are members of groups in the data that should only appear in either the test split or train split but not both.
Outside of PySpark
, I could use StratifiedGroupKFold
from sklearn
. What is the easiest way to achieve the same effect with PySpark
?
I looked at the sampleBy
method from PySpark
, but I'm not sure how to use it while keeping the groups separate.
Documentation links: