3

I have gone through replace missing values in categorical data regarding handling missing values in categorical data.

Dataset has about 6 categorical columns with missing values. This would be for a binary classification problem

I see different approaches where one is to just leave the missing values in category column as such, other to impute using from sklearn.preprocessing import Imputer, but unsure which is better option.

In case if imputing is better option, which libraries could I use before applying the model like LR,Decision Tree, RandomForest.

Thanks!

pc_pyr
  • 562
  • 5
  • 20
  • 1
    no general answer, depends on model + dataset (eg : xgboost handles missing values out of the box .. ) – avvinci May 18 '20 at 18:35

2 Answers2

2

There are multiple ways to handle missing data :

  • Some models take care of it (XGBoost, LightGBM for example)
  • You can try to impute them with a model. You should split your data in a train and test set, and try different models to measure which one works best. But more often that not, it doesnt' work very well. There is a KNNImputer implemented in sklearn
  • you can also define rules : set missing values to 0, the mean, median or whatever works, depending on your dataset. The is a SimpleImputer implemenetd in sklearn
  • if none of the above is working for you, you can also get rid of the lines with missing values.

More details on values imputing in sklearn : https://scikit-learn.org/stable/modules/impute.html

CoMartel
  • 3,521
  • 4
  • 25
  • 48
2

Adding to @CoMartel,

  1. There exists no specific rule that can guarantee you good results. You need to check all the known ways one by one & observe your model's performance.

  2. But if the ratio of missing values is very high for a column (like >50% of the total rows. The threshold can also vary ), you should better drop that column.

  3. Also, if you have categorical data missing, you should try avoiding mean as suppose you encoded one of the categories as 1 & other as 2 but the mean is 2.5, it won't represent any category actually. The mode will be a better option than mean & median

Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33