4

Have a train dataset with multi-class target variable category

train.groupby('category').size()

0     2220
1     4060
2      760
3     1480
4      220
5      440
6    23120
7     1960
8    64840

I would like to get the new validation dataset from the train set by having the percentage from each class (let's say 20%) to avoid missing classes in validation set and spoiling the model. So basically the desirable output would be df with the same structure and info like train set but with parameters like these:

0     444
1     812
2     152
3     296
4      44
5      88
6    4624
7     392
8   12968

Is there any straight-forward approach for solving it in pandas?

Keithx
  • 2,994
  • 15
  • 42
  • 71
  • You can pass the `stratify` parameter in the sklearn `train_test_split` with the targets and it will stratify the data for you – G. Anderson Nov 26 '18 at 23:16

1 Answers1

3

Groupby and sample should do that for you

df = pd.DataFrame({'category': np.random.choice(['a', 'b', 'c', 'd', 'e'], 100), 'val': np.random.randn(100)})

idx = df.groupby('category').apply(lambda x: x.sample(frac=0.2, random_state = 0)).index.get_level_values(1)

test = df.iloc[idx, :].reset_index(drop = True)
train = df.drop(idx).reset_index(drop = True)

Edit: You can also use scikit learn,

df = pd.DataFrame({'category': np.random.choice(['a', 'b', 'c', 'd', 'e'], 100), 'val': np.random.randn(100)})

X = df.iloc[:, :1].values
y = df.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = X)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((80, 1), (20, 1), (80,), (20,))
Vaishali
  • 37,545
  • 5
  • 58
  • 86