I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?
Asked
Active
Viewed 1,284 times
1 Answers
3
Say your data looks like something generated from this code:
import numpy as np
x = np.random.randn(100, 3)
y = np.array([int(i % 5 == 0) for i in range(100)])
(only a 1/5th of y
is 1, which is the minority class).
To find the size of the minority class, do:
>>> np.sum(y == 1)
20
To find the subset that consists of the majority class, do:
majority_x, majority_y = x[y == 0, :], y[y == 0]
To find a random subset of size 20, do:
inds = np.random.choice(range(majority_x.shape[0]), 20)
followed by
majority_x[inds, :]
and
majority_y[inds]

Ami Tavory
- 74,578
- 11
- 141
- 185
-
thanks but it split it to training,testing sets ... I want to randomly subsample the training set only, is it still possible to use it ?I am not sure how the function has been implemented in Weka in detail but I am looking for the same – Ophilia Jan 16 '16 at 19:17
-
Got it, misunderstood your question. Will write a different answer. – Ami Tavory Jan 16 '16 at 19:24
-
Sorry forgot to ask .. How can I replace the new sample majority_x and majority_y with the old one in x and y in order to use it with the other class as an input for a classifier ? – Ophilia Jan 30 '16 at 15:26
-
@user2739381 I'll be very happy to have a look, but could you please open it as a separate question? It's very difficult to hold a dialog in the comments section. – Ami Tavory Jan 30 '16 at 16:01
-
Many thanks, I did submit it in another question here http://stackoverflow.com/questions/35106112/subsampling-classifying-using-scikit-learn – Ophilia Jan 30 '16 at 19:31