Select important features then impute or first impute then select important features?

Question

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.

One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.

I am also using an imputer to fill the missing values.

My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?

This seems to not really be appropriate for SO and would be much better on Cross Validated. — Tchotchke, Jun 02 '16 at 01:09

lejlot · Answer 1 · 2016-06-02T19:41:27.743

-1

From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).

Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.

edited Jun 02 '16 at 19:41

answered Jun 01 '16 at 22:26

lejlot

64,777
8
131
164

The answer does not say you should **never** do it. It says "if you have a method which can work with missing values - use it"; if you do not have such method, data imputation might be the only way. "To avoid" does not mean "never use", but I will rephrase first sentence to make it clear – lejlot Jun 02 '16 at 19:40

Select important features then impute or first impute then select important features?

1 Answers1