Very low performance even after oversampling dataset

Question

I'm using an MLPClassifier for classification of heart diseases. I used imblearn.SMOTE to balance the objects of each class. I was getting very good results (85% balanced acc.), but i was advised that i would not use SMOTE on test data, only for train data. After i made this changes, the performance of my classifier fell down too much (~35% balanced accuracy) and i don't know what can be wrong.

Here is a simple benchmark with training data balanced but test data unbalanced:

And this is the code:

    def makeOverSamplesSMOTE(X,y):
         from imblearn.over_sampling import SMOTE
         sm = SMOTE(sampling_strategy='all')
         X, y = sm.fit_sample(X, y)
         return X,y
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

    ## Normalize data
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    X_train = sc_X.fit_transform(X_train)
    X_test = sc_X.fit_transform(X_test)

    ## SMOTE only on training data
    X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)

    clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
                        learning_rate_init=0.5, max_iter=2000, 
                        activation='logistic', solver='sgd', shuffle=True, random_state=30)

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

I'd like to know what i'm doing wrong, since this seems to be the proper way of preparing data.

There should be a weight parameter that you can set so that when you train on 50% of positive class after SMOTE, you don't predict 50% of positive class without SMOTE. — Pierre Gourseaud, Jul 26 '19 at 14:44

secretive · Accepted Answer · 2019-07-26T16:32:49.643

2

The first mistake in your code is when you are transforming data into standard format. You only need to fit StandardScaler once and that is on X_train. You shouldn't refit it on X_test. So the correct code will be:

def makeOverSamplesSMOTE(X,y):
     from imblearn.over_sampling import SMOTE
     sm = SMOTE(sampling_strategy='all')
     X, y = sm.fit_sample(X, y)
     return X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

## Normalize data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

## SMOTE only on training data
X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)

clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
                    learning_rate_init=0.5, max_iter=2000, 
                    activation='logistic', solver='sgd', shuffle=True, random_state=30)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

For the machine learning model, try reducing the learning rate. it is too high. the default learning rate in sklearn is 0.001. Try changing the activation function and the number of layers. Also not every ML model works on every dataset so you might need to look at your data and choose ML model accordingly.

edited Jul 26 '19 at 16:32

answered Jul 26 '19 at 15:48

secretive

2,032
7
16

Can you please update your code considering the SMOTE application? I don’t see where it goes. Another thing: your code is actually applying the scaler on test data, but in your answer you said to don’t use it. I got confused – heresthebuzz Jul 26 '19 at 16:26
I am transforming the test data but I am not fitting the `sc_X` on that data. – secretive Jul 26 '19 at 16:33
I made the changes as you said, but there was no better result. About the classifier, there's nothing wrong since MLP is pretty good for this kind of problem. This low results appeared just after oversampling only the `train` and not the whole dataset – heresthebuzz Jul 26 '19 at 20:43
No, MLP is not the best performer on imbalanced dataset. One reason for this loss in prediction is that the model is simply overfitting to the training data which is an oversampled version of a smaller data. Check the training accuracy. Also check the per class accuracy. – secretive Jul 27 '19 at 14:23
Try cross validation along with NN algorithm, may provide better result. – SUN Aug 02 '19 at 00:43

score 0 · Answer 2 · answered Aug 02 '19 at 00:40

Hope you have already got better result for your model.I tried by changing few parameter, and I getting accuracy of 65%, when I change it to 90:10 sample I got an accuracy of 70%. But accuracy can mislead,so I calculated F1 score which give you better picture of prediction.

from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(1,),verbose=False,
                    learning_rate_init=0.001, 
                    max_iter=2000, 
                    activation='logistic', solver='sgd', shuffle=True, random_state=50)

clf.fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score, confusion_matrix ,classification_report
score=accuracy_score(y_test, y_pred, )
print(score)
cr=classification_report(y_test, clf.predict(X_test))
print(cr)

Accuracy = 0.65

classification report : precision recall f1-score support

       0       0.82      0.97      0.89        33
       1       0.67      0.31      0.42        13
       2       0.00      0.00      0.00         6
       3       0.00      0.00      0.00         4
       4       0.29      0.80      0.42         5

   micro avg       0.66      0.66      0.66        61
   macro avg       0.35      0.42      0.35        61
weighted avg       0.61      0.66      0.61        61

confusion_matrix:

array([[32,  0,  0,  0,  1],
       [ 4,  4,  2,  0,  3],
       [ 1,  1,  0,  0,  4],
       [ 1,  1,  0,  0,  2],
       [ 1,  0,  0,  0,  4]], dtype=int64)

Very low performance even after oversampling dataset

2 Answers2