4

I have a Pandas DataFrame. I am trying to create a sample DataFrame with replacement and also stratify it.

This allows me to replace:

df_test = df.sample(n=100, replace=True, random_state=42, axis=0)

However, I am not sure how to also stratify. Can I use the weights parameter and if so how? The columns I want to stratify are strings.

This allows me to stratify:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=.50, stratify=Y, random_state=42)

However, there is no option to replace.

How can I both stratify and replace?

Alfe
  • 56,346
  • 20
  • 107
  • 159

2 Answers2

4

This is a bit of an old question, but since google returned me this in the first place when I was looking for the same thing, I thought it would be useful to leave this here for everybody, including my future self.

Apparently sklearn offers this functionality in sklearn.utils.resample:

from sklearn import datasets
from sklearn.utils import resample

X, y = datasets.load_iris(return_X_y=True)
X_new, y_new = resample(X, y, stratify=y)

You can control the amount of samples with the n_samples parameter. By default it is set to None, so you get back X.shape[0] random samples with replacement (as this was designed for bootstrapting purposes). Hope this helps someone.

agaldran
  • 95
  • 6
  • Note that the option `stratify` was introduced only with [sklearn v0.21](https://scikit-learn.org/stable/whats_new.html#sklearn-utils) – normanius Sep 25 '19 at 15:53
0

as far as i know, the default StratifiedShuffleSplit from sklearn will run with replacement, i.e. non mutually exclusive strats. hope i understood you correctly.

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
sss.get_n_splits(X, y)

print(sss)       

for train_index, test_index in sss.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

yields:

TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]
Max
  • 36
  • 4