0

I just worked on "Heart Failure Prediction" dataset from kaggle ( https://www.kaggle.com/andrewmvd/heart-failure-clinical-data )

And i noticed the number of "Not dead" were more then the number of "dead" so i used SMOTETomek, which resampled my data and i predicted the accuracy and printed the confusion matrix, which had pretty good results then before.

df.DEATH_EVENT.value_counts()

0    202
1     95
Name: DEATH_EVENT, dtype: int64

accuracy and confusion matrix: before

0.7888888888888889
[[130  30]
[  8  12]]

the convertion code:

smt = SMOTETomek(random_state=42)
X_res,y_res = smt.fit_resample(X,y)
pd.DataFrame(y_res)['DEATH_EVENT'].value_counts()

1    155
0    155
Name: DEATH_EVENT, dtype: int64

accuracy and confusion matrix: after

0.912
[[53  5]
[ 6 61]]

but this was a small sample.

From your experience does using oversampling or undersampling approaches lead to better results in general? or do we get some kind of false results because of it and our model won't perform just as good in real world?

  • 2
    Since you're not asking about anything related to coding, I think you're better off asking this question on another site. Besides, there are already great discussions out there on the internet. – sander May 06 '21 at 07:57
  • i guess your right, and when i thought about it i got the answer. when we oversample we basically create more points (near by already available points) so it is clear that it can lead to data leakage... – Jack Froster May 12 '21 at 00:43

0 Answers0