-2

I have a model with lot of missing data. There are around 20000 records for training and 5000 records for testing on which models performance is validated.The model has around 120 features. I have identified the cluster in the model based on certain feature and replaced missing values with median within those clusters. So most of the missing values are treated. When I could not find cluster I replaced missing values with zero. I tested this model performance, randomforest,xgboosting seems to have almost similar performance on these data. Xgboosting has 0.5 % higher accuracy.I tried to select best features from RFE and found that maximum i could obtain is 80% for this model. Also i observed that training accuracy is 80% and validation accuracy is 100%. How can I reduce the overfittness of the model. Does my missing data imputation being done wrongly? I know the model accuracy can go upto 90%. Not sure what I am doing wrong here. What should be done to boost my accuracy

behappy
  • 35
  • 1
  • 7

1 Answers1

0

More data, feature selection, feature engineering.... Look on your data, fill missing field, maybe you find new correlations between data. There's no simple answer. Be creative.

newblack
  • 73
  • 2
  • 11