0

I am working with an extremely unbalanced dataset with a total of 44 samples for my research project. It is a binary classification problem with 3/44 samples of the minority class for which I am using Leave One Out Cross Validation. If I perform SMOTE oversampling of the entire dataset prior to LOOCV loop, both prediction accuracy and AUC for ROC curves are close to 90% and 0.9 respectively. However, if I oversample only the training set inside the LOOCV loop, which happens to be a more logical approach, AUC for ROC curves falls as low as 0.3

I also tried precision-recall curves and stratified k-fold cross validation but faced a similar distinction in results from oversampling outside and inside the loop. Please suggest me what is the right place to oversample and also explain the distinction if possible.

Oversampling inside the loop:-

i=0
acc_dec = 0
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split

for train, test in loo.split(X):    #Leave One Out Cross Validation
    #Create training and test sets for split indices
    X_train = X.loc[train]  
    y_train = Y.loc[train]
    X_test = X.loc[test]
    y_test = Y.loc[test]

    #oversampling minority class using SMOTE technique
    sm = SMOTE(sampling_strategy='minority',k_neighbors=1)
    X_res, y_res = sm.fit_resample(X_train, y_train)

    #KNN
    clf = KNeighborsClassifier(n_neighbors=5) 
    clf = clf.fit(X_res,y_res)
    y_pred = clf.predict(X_test)
    acc_dec = acc_dec +  metrics.accuracy_score(y_test, y_pred)
    y_test_dec.append(y_test.to_numpy()[0])
    y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
    i+=1

# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")

AUC: 0.25

Accuracy: 68.1%

Oversampling Outside the loop:

acc_dec=0 #accuracy for decision tree classifier
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split
i=0
#Oversampling before the loop
sm = SMOTE(k_neighbors=1)
X, Y = sm.fit_resample(X, Y)   
X=pd.DataFrame(X)
Y=pd.DataFrame(Y)
for train, test in loo.split(X):    #Leave One Out Cross Validation

    #Create training and test sets for split indices
    X_train = X.loc[train]  
    y_train = Y.loc[train]
    X_test = X.loc[test]
    y_test = Y.loc[test]

    #KNN
    clf = KNeighborsClassifier(n_neighbors=5) 
    clf = clf.fit(X_res,y_res)
    y_pred = clf.predict(X_test)
    acc_dec = acc_dec +  metrics.accuracy_score(y_test, y_pred)
    y_test_dec.append(y_test.to_numpy()[0])
    y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
    i+=1

# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")

AUC: 0.99

Accuracy: 90.24%

How can these two approaches lead to so different results? What shall I follow?

varshika03
  • 71
  • 3

1 Answers1

1

Doing upsampling (like SMOTE) before you split your data means information about the training set is present in the test set. This is sometimes called "leakage". Your first setup is, unfortunately, correct.

Here's a post walking through this problem.

stevemo
  • 1,077
  • 6
  • 10