I am doing an analysis of the effect of SMOTE on the performance of Random Forest and Logistic Regression. I have the following data from kaggle. The data consists of around 50000 observations and 58 variables. I trained four models on it:
- Random Forest
- Random Forest with SMOTE
- Logistic Regression
- Logistic Regression with SMOTE
I got the following results:
− = sqrt( × y)
Question: What causes the Logistic Regression to improve a lot with SMOTE and what causes the Random Forest to not improve so much?
My thought was that it may be because of the high dimensionality but I would expect the Random Forest to do better than the Logistic Regression.