0

In my datasets, I have one variable which contains 30% missing value.

I am trying to use tree based model but not getting clear picture how to implement it.

data['X'].value_counts()
OUTPUT-----
?     39454
MC    32223
HM     6197
SP     4892
BC     4569
MD     3473
CP     2493
UN     2366
CM     1932
OG     1020
PO      585
DM      536
CH      145
WC      130
OT       94
MP       79
SI       52
FR        1 

The approach which I am trying to implement is:

Suppose this variable has 24 distinct category. And the above is the value count output. ? is the missing value and I should impute the value among rest of the value mentioned with the help of tree based model.

Different categories are MC HM SP BC MD CP UN CM OG PO DM CH WC OT MP ST FR ? and count of ? is 39454. So we have 39454 missing values that we should impute with the help of tree based model

Now, with the help of above values I have to train a model and predict the missing value.

halfer
  • 19,824
  • 17
  • 99
  • 186
user2986845
  • 75
  • 1
  • 2
  • 8

1 Answers1

2

I would recommend below :

  1. Take non-missing data and perform clustering
  2. assign labels for missing data using appropriate cluster
Dharman
  • 30,962
  • 25
  • 85
  • 135