I have a classification project where some of the columns/features have more than 90% null values. How do I handle them?

Question

In my classification problem, some of the features(~5) among 85 features have mostly null values (>90%). How do I handle these values? Do I,

1) Ignore these columns/features altogether

2) Try and impute these values, if so how?

3) Any other method?

I am starting with random forests and I am a newbie to this method, does random forest handle null values by itself? How can I implement this? how does random forest do this? Where can I learn about this - any references would be much welcome.

Thanks in advance.

This is not a good question for SO, as it a) is not about programming and b) is way too broad. I'd suggest trying it out yourself (it would be very easy to see if RF handles null values just by running it) and also removing this post and asking a more focused question at [Cross Validated](https://stats.stackexchange.com) — Tchotchke, May 08 '17 at 19:53

score 0 · Answer 1 · answered Jun 04 '17 at 19:36

Have you tried running the neural-network on your dataset even though features are missing? A neural-network does not need all features to be present.

You can simply set all missing features values to 0 for the neural network, as neural networks don't see a difference between 0 and feature is missing. Why not you ask? If you set an input value to 0, that means all the connections from that input node will have a 0 value a well: adding nothing to the hidden neurons that are connected to that input node.

But before you try, ask yourself this question: if some feature is missing so often, is it of any importance to the dataset prediction?

I have a classification project where some of the columns/features have more than 90% null values. How do I handle them?

1 Answers1