In my datasets, I have one variable which contains 30% missing value.
I am trying to use tree based model but not getting clear picture how to implement it.
data['X'].value_counts()
OUTPUT-----
? 39454
MC 32223
HM 6197
SP 4892
BC 4569
MD 3473
CP 2493
UN 2366
CM 1932
OG 1020
PO 585
DM 536
CH 145
WC 130
OT 94
MP 79
SI 52
FR 1
The approach which I am trying to implement is:
Suppose this variable has 24 distinct category. And the above is the value count output. ? is the missing value and I should impute the value among rest of the value mentioned with the help of tree based model.
Different categories are MC HM SP BC MD CP UN CM OG PO DM CH WC OT MP ST FR ? and count of ? is 39454. So we have 39454 missing values that we should impute with the help of tree based model
Now, with the help of above values I have to train a model and predict the missing value.