-1

I have the following problem, there is a classification problem. On the track 50,000 lines, on Y 60 labels. But the data is unbalanced (in one class, 35000 values, in the other 59 classes 15000 values, of which in some 30 values). If for example, that is, X (column_1, column_2, column_3) and Y:

colum_1   colum_2   colum_3   Y
  0.5        1         2      1
  0.5        1.1       2      1
  0.55       0.95      3      1
  0.1        1         2      2
  2          0.9       3      3

And need to add "noisy" data, so that there is no imbalance, conditionally, that all values become the same:

colum_1   colum_2   colum_3   Y
  0.5        1         2      1
  0.5        1.1       2      1
  0.55       0.95      3      1
  0.1        1         2      2
  0.15       0.99      2      2
  0.05       1.01      2      2 
  2          0.9       3      3
  1.95       0.95      3      3
  2.05       0.85      3      3

Only this is a toy example, but I have many meanings.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Katrin
  • 11
  • 1
  • 5
  • 1
    I can't locate a question here – Ofer Sadan Jun 10 '18 at 18:49
  • Are you trying to add another column that contains noise, or are you trying to alter the existing values with noise? It is not clear what you are trying to do. – gammazero Jun 10 '18 at 19:29
  • So you need to do the oversampling of the minority class? Whats the problem in that? There are some libraries available in python which does this. Whats your question? – Vivek Kumar Jun 11 '18 at 09:08

1 Answers1

0

Although the question is not exactly clear, I think you're looking for help with oversampling the minority classes. A common approach would be the SMOTE algorithm, which you can find in the imblearn package.

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42, ratio = 1.0)
X_res, Y_res = sm.fit_sample(X_train, Y_train)

Just make sure you divide your data up into train and test groups first, and then over-sample each group separately so you don't end of with the same data in both. A fuller description here.

4Oh4
  • 2,031
  • 1
  • 18
  • 33