12

I have a dataframe of the form, df:

    cat_var_1    cat_var_2     num_var_1
0    Orange       Monkey         34
1    Banana        Cat           56
2    Orange        Dog           22
3    Banana       Monkey          6
..

Suppose the possible values of cat_var_1 in the dataset have the ratios- ['Orange': 0.6, 'Banana': 0.4] and the possible values of cat_var_2 have the ratios ['Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1].

How to I split the data into train, test and validation sets (60:20:20 split) such that the ratios of the categorical variables remain preserved? In practice, these variables can be of any number, not just two. Also, clearly, the exact ratios may never be achieved in practice, but we would like it to be as near as possible.

I have looked into the StratifiedKFold method from sklearn described here: how to split a dataset into training and validation set keeping ratio between classes? but this is restricted to evaluating on the basis of one categorical variable only.

Additionally, I would be grateful if you could provide the complexity of the solution you achieve.

Jörn Hees
  • 3,338
  • 22
  • 44
Melsauce
  • 2,535
  • 2
  • 19
  • 39
  • From the given three columns, which is the dependent variable ? – YOLO Feb 26 '18 at 12:36
  • 1
    Why does it matter? – Melsauce Feb 26 '18 at 12:43
  • 1
    Hmm, maybe think about it in terms of supervised machine learning: what are your actual classes that you want to train a classifier to learn? Stratified sampling there usually refers to artificially causing the frequencies of these classes to be equal. Especially in heavily imbalanced scenarios and little training data, classifiers could otherwise simply learn to ignore a class as it's encountered too infrequently during training. If you want to keep priors as they actually are: do random sampling with enough data! If you don't have enough: sample randomly, then move samples until happy? – Jörn Hees Feb 26 '18 at 12:44
  • Thank you for the suggestion, but I have considered all of those things. I want to try this particular method for a problem. – Melsauce Feb 26 '18 at 12:47
  • pass `df.cat_var_1+ "_" + df.cat_var_2` to argument `Y` of `split()` – HYRY Mar 01 '18 at 07:36

1 Answers1

13

You can pass df.cat_var_1+ "_" + df.cat_var_2 to argument y of StratifiedShuffleSplit.split():

But here is a method that use DataFrame.groupby:

import pandas as pd
import numpy as np

nrows = 10000
p1 = {'Orange': 0.6, 'Banana': 0.4}
p2 = {'Monkey': 0.2, 'Cat': 0.7, 'Dog': 0.1}

c1 = [key for key, val in p1.items() for i in range(int(nrows * val))]
c2 = [key for key, val in p2.items() for i in range(int(nrows * val))]
random.shuffle(c1)
random.shuffle(c2)

df = pd.DataFrame({"c1":c1, "c2":c2, "val":np.random.randint(0, 100, nrows)})

index = []
for key, idx in df.groupby(["c1", "c2"]).groups.items():
    arr = idx.values.copy()
    np.random.shuffle(arr)
    p1 = int(0.6 * len(arr))
    p2 = int(0.8 * len(arr))
    index.append(np.split(arr, [p1, p2]))

idx_train, idx_test, idx_validate = list(map(np.concatenate, zip(*index)))
HYRY
  • 94,853
  • 25
  • 187
  • 187