0

I am creating a simple code which allows to down-sample a dataframe when your target variable has more than 2 classes.

Let df be our arbitrary dataset and 'TARGET_VAR' a categorical variable with more than 2 classes.

import pandas as pd
label='TARGET_VAR' #define the target variable

num_class=df[label].value_counts() #creates list with the count of each class value
temp=pd.DataFrame() #create empty dataframe to be filled up

for cl in num_class.index: #loop through classes
    #iteratively downsample every class according to the smallest
    #class 'min(num_class)' and append it to the dataframe.
    temp=temp.append(df[df[label]==cl].sample(min(num_class)))

df=temp #redefine initial dataframe as the subsample one

del temp, num_class #delete temporary dataframe

Now I was wondering, is there a way to do this in a more refined way? e.g. without having to create the temporary dataset? I tried to figure out a way to "vectorize" the operation for multiple classes but didn't get anywhere. Below is my idea, which can easily be implemented for 2 classes but I have no idea how to expand it to the multiple classes case.

This works perfectly if you have 2 classes

 df= pd.concat([df[df[label]==num_class.idxmin()],\
 df[df[label]!=num_class.idxmin()].sample(min(num_class))])

This allows you to pick the right amount of observations for the other classes but the classes will not necessarily be equally represented.

 df1= pd.concat([df[df[label]==num_class.idxmin()],\
 df[df[label]!=num_class.idxmin()].sample(min(num_class)*(len(num_class)-1))])
CAPSLOCK
  • 6,243
  • 3
  • 33
  • 56

3 Answers3

5

You could try something similar to this:

label='TARGET_VAR'

g = df.groupby(label, group_keys=False)
balanced_df = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()))).reset_index(drop=True)

I believe this will produce the result you want, feel free to ask any further questions.

Edit

Fixed the code according to OP's suggestion.

Gustavo Fonseca
  • 621
  • 6
  • 9
  • Hi Gustavo, thanks for your answer! I didn't know the groupby method. It's really powerful! However this way I am still creating a temporary object any idea of its properties? Do you know if it could become problematic (mainly in terms of memory) if the original dataframe `df` gets very big? – CAPSLOCK Mar 12 '19 at 14:32
  • I believe it will perform similar to your previous approach, just cleaner. But feel free to mock a big dataframe and benchmark both methods. Pick the one that suits you better. – Gustavo Fonseca Mar 12 '19 at 17:30
1

This code is used for oversampling instances of the minority class or undersampling instances of the majority class. It should be used only on the training set. Note: activity is the label

balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))
javac
  • 2,819
  • 1
  • 20
  • 22
1

Gustavo's answer is correct but has a small problem (and for some reason I can't edit his answer).

label='TARGET_VAR'

g = df.groupby(label, group_keys=False)
balanced_df = pd.DataFrame(g.apply(lambda x: 
x.sample(g.size().min()).
reset_index(drop=True)))

Here the index will be reset for each group and the final dataframe will have repeating row indices. If we define the number of elements in the minority class as n:

idx, data 
0,   ...
1,   ...
.,   ...
.,   ...
.,   ...
n,   ...
0,   ...
1,   ...
.,   ...
.,   ...
.,   ...
n,   ...

The following tweak will solve the issue

g = df.groupby(label, group_keys=False)
balanced_df = pd.DataFrame(g.apply(lambda x: 
x.sample(g.size().min()))).reset_index(drop=True)

If we now define the total number of elements of the balanced_df as N=n*k, with k being the number of different classes. The index will look like this:

idx, data 
0,   ...
1,   ...
.,   ...
.,   ...
.,   ...
N,   ...
CAPSLOCK
  • 6,243
  • 3
  • 33
  • 56