Getting Validation set from Train set by using percentage from groupby() in pandas

Question

Have a train dataset with multi-class target variable category

train.groupby('category').size()

0     2220
1     4060
2      760
3     1480
4      220
5      440
6    23120
7     1960
8    64840

I would like to get the new validation dataset from the train set by having the percentage from each class (let's say 20%) to avoid missing classes in validation set and spoiling the model. So basically the desirable output would be df with the same structure and info like train set but with parameters like these:

Is there any straight-forward approach for solving it in pandas?

You can pass the `stratify` parameter in the sklearn `train_test_split` with the targets and it will stratify the data for you — G. Anderson, Nov 26 '18 at 23:16

Vaishali · Accepted Answer · 2018-11-27T00:17:49.280

Groupby and sample should do that for you

df = pd.DataFrame({'category': np.random.choice(['a', 'b', 'c', 'd', 'e'], 100), 'val': np.random.randn(100)})

idx = df.groupby('category').apply(lambda x: x.sample(frac=0.2, random_state = 0)).index.get_level_values(1)

test = df.iloc[idx, :].reset_index(drop = True)
train = df.drop(idx).reset_index(drop = True)

Edit: You can also use scikit learn,

df = pd.DataFrame({'category': np.random.choice(['a', 'b', 'c', 'd', 'e'], 100), 'val': np.random.randn(100)})

X = df.iloc[:, :1].values
y = df.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = X)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((80, 1), (20, 1), (80,), (20,))

So how can I drop these values from the first set? By using drop()? — Keithx, Nov 26 '18 at 23:22
also possible variant - pd.concat(train,validation).drop_duplicates(keep=False) — Keithx, Nov 26 '18 at 23:50

Getting Validation set from Train set by using percentage from groupby() in pandas

1 Answers1