2

I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?

Ophilia
  • 717
  • 1
  • 10
  • 25

1 Answers1

3

Say your data looks like something generated from this code:

import numpy as np

x = np.random.randn(100, 3)
y = np.array([int(i % 5 == 0) for i in range(100)])

(only a 1/5th of y is 1, which is the minority class).

To find the size of the minority class, do:

>>> np.sum(y == 1)
20

To find the subset that consists of the majority class, do:

majority_x, majority_y = x[y == 0, :], y[y == 0]

To find a random subset of size 20, do:

inds = np.random.choice(range(majority_x.shape[0]), 20)

followed by

majority_x[inds, :]

and

majority_y[inds]
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • thanks but it split it to training,testing sets ... I want to randomly subsample the training set only, is it still possible to use it ?I am not sure how the function has been implemented in Weka in detail but I am looking for the same – Ophilia Jan 16 '16 at 19:17
  • Got it, misunderstood your question. Will write a different answer. – Ami Tavory Jan 16 '16 at 19:24
  • Sorry forgot to ask .. How can I replace the new sample majority_x and majority_y with the old one in x and y in order to use it with the other class as an input for a classifier ? – Ophilia Jan 30 '16 at 15:26
  • @user2739381 I'll be very happy to have a look, but could you please open it as a separate question? It's very difficult to hold a dialog in the comments section. – Ami Tavory Jan 30 '16 at 16:01
  • Many thanks, I did submit it in another question here http://stackoverflow.com/questions/35106112/subsampling-classifying-using-scikit-learn – Ophilia Jan 30 '16 at 19:31