I have a model with lot of missing data. There are around 20000 records for training and 5000 records for testing on which models performance is validated.The model has around 120 features. I have identified the cluster in the model based on certain feature and replaced missing values with median within those clusters. So most of the missing values are treated. When I could not find cluster I replaced missing values with zero. I tested this model performance, randomforest,xgboosting seems to have almost similar performance on these data. Xgboosting has 0.5 % higher accuracy.I tried to select best features from RFE and found that maximum i could obtain is 80% for this model. Also i observed that training accuracy is 80% and validation accuracy is 100%. How can I reduce the overfittness of the model. Does my missing data imputation being done wrongly? I know the model accuracy can go upto 90%. Not sure what I am doing wrong here. What should be done to boost my accuracy
Asked
Active
Viewed 108 times
1 Answers
0
More data, feature selection, feature engineering.... Look on your data, fill missing field, maybe you find new correlations between data. There's no simple answer. Be creative.

newblack
- 73
- 2
- 11