-1

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.

One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.

I am also using an imputer to fill the missing values.

My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?

Karup
  • 2,024
  • 3
  • 22
  • 48

1 Answers1

-1

From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).

Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • The answer does not say you should **never** do it. It says "if you have a method which can work with missing values - use it"; if you do not have such method, data imputation might be the only way. "To avoid" does not mean "never use", but I will rephrase first sentence to make it clear – lejlot Jun 02 '16 at 19:40