random subsampling of the majority class

Question

I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?

Ami Tavory · Accepted Answer · 2016-01-16T19:33:52.333

3

Say your data looks like something generated from this code:

import numpy as np

x = np.random.randn(100, 3)
y = np.array([int(i % 5 == 0) for i in range(100)])

(only a 1/5th of y is 1, which is the minority class).

To find the size of the minority class, do:

>>> np.sum(y == 1)
20

To find the subset that consists of the majority class, do:

majority_x, majority_y = x[y == 0, :], y[y == 0]

To find a random subset of size 20, do:

inds = np.random.choice(range(majority_x.shape[0]), 20)

followed by

majority_x[inds, :]

and

majority_y[inds]

edited Jan 16 '16 at 19:33

answered Jan 16 '16 at 16:50

Ami Tavory

74,578
11
141
185

thanks but it split it to training,testing sets ... I want to randomly subsample the training set only, is it still possible to use it ?I am not sure how the function has been implemented in Weka in detail but I am looking for the same – Ophilia Jan 16 '16 at 19:17
Got it, misunderstood your question. Will write a different answer. – Ami Tavory Jan 16 '16 at 19:24
Sorry forgot to ask .. How can I replace the new sample majority_x and majority_y with the old one in x and y in order to use it with the other class as an input for a classifier ? – Ophilia Jan 30 '16 at 15:26
@user2739381 I'll be very happy to have a look, but could you please open it as a separate question? It's very difficult to hold a dialog in the comments section. – Ami Tavory Jan 30 '16 at 16:01
Many thanks, I did submit it in another question here http://stackoverflow.com/questions/35106112/subsampling-classifying-using-scikit-learn – Ophilia Jan 30 '16 at 19:31

random subsampling of the majority class

1 Answers1