0

I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!

(To be clear, I used SMOTE to upsample the failure samples in training dataset)

ERIC_STAR
  • 1
  • 2

1 Answers1

0

Getting 100% recall is trivial in fact: just classify everything as 1.

Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:

probabilities = model.predict_proba(X_test)    
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]
dx2-66
  • 2,376
  • 2
  • 4
  • 14
  • I also plotted the precision & recall plot, and the result is very bad. The area under the precision-recall curve is almost 0. One question in mind is I used 3-fold cross-validation and the cv performance is good (training set is up-sampled by using SMOTE), but the performance on test is very bad. (I understand the performance on cross-validation might a bit overfitting, but in this case the performance dramatically dropped). I have no idea how to proceed. – ERIC_STAR Oct 21 '22 at 10:12