Multiclass classification to balance in python (over sampling)

Question

I have the following problem, there is a classification problem. On the track 50,000 lines, on Y 60 labels. But the data is unbalanced (in one class, 35000 values, in the other 59 classes 15000 values, of which in some 30 values). If for example, that is, X (column_1, column_2, column_3) and Y:

colum_1   colum_2   colum_3   Y
  0.5        1         2      1
  0.5        1.1       2      1
  0.55       0.95      3      1
  0.1        1         2      2
  2          0.9       3      3

And need to add "noisy" data, so that there is no imbalance, conditionally, that all values become the same:

colum_1   colum_2   colum_3   Y
  0.5        1         2      1
  0.5        1.1       2      1
  0.55       0.95      3      1
  0.1        1         2      2
  0.15       0.99      2      2
  0.05       1.01      2      2 
  2          0.9       3      3
  1.95       0.95      3      3
  2.05       0.85      3      3

Only this is a toy example, but I have many meanings.

Are you trying to add another column that contains noise, or are you trying to alter the existing values with noise? It is not clear what you are trying to do. — gammazero, Jun 10 '18 at 19:29
So you need to do the oversampling of the minority class? Whats the problem in that? There are some libraries available in python which does this. Whats your question? — Vivek Kumar, Jun 11 '18 at 09:08

score 0 · Answer 1 · answered Jun 11 '18 at 09:40

Although the question is not exactly clear, I think you're looking for help with oversampling the minority classes. A common approach would be the SMOTE algorithm, which you can find in the imblearn package.

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42, ratio = 1.0)
X_res, Y_res = sm.fit_sample(X_train, Y_train)

Just make sure you divide your data up into train and test groups first, and then over-sample each group separately so you don't end of with the same data in both. A fuller description here.

Multiclass classification to balance in python (over sampling)

1 Answers1