How do you deal with missing data when it's missing like 60%?

Question

My data has a lot of missing values and I have to predict those values. One way is to take the average of those values. But I want to hear an other perspective on it. How experienced data scientist solve such kind of issue?

I'm not an experienced data scientist but I'd try to understand what the data means and what values are to be expected, i.e. design a model. — Hans-Martin Mosner, Jul 23 '19 at 06:32
This question is probably better suited for datascience.stackexchange.com — Psychotechnopath, May 04 '20 at 21:36

score -1 · Answer 1 · answered Jul 24 '19 at 14:42

Are your missing values categorical or continuous?

One way is to remove the samples entirely, however this may lead to a sampling bias, since the missing values could have been the result of some causal effect, that is the missing values are not missing completely at random.

If your data has enough dimensionality, you can treat your missing values as the output and try to apply a predicting model and hope that it can faithfully estimate the missing values, given the explanatory variables you already have.

Picking the most frequent value, the median, or averaging as you point out could also be an option, however be careful with outliers when averaging as these can have a tremendous effect on the mean.

I am trying to remove NaN values with dataframe = dataframe.dropna() but it is not working. If I can somehow remove NaN values, I can predict missing values based on other variables with simple linear regression model. — hammadshahir, Jul 24 '19 at 20:08

score -1 · Answer 2 · answered Jul 26 '19 at 19:07

-1

It depends on nature of variables, it may be some statistics like mean or median. Another practice is assign to missing variables some value different from others for example 0, -1 or something like this.

answered Jul 26 '19 at 19:07

Soslan Tabuev

143
6

score -1 · Answer 3 · answered Aug 08 '19 at 01:02

The hardest approach is to impute the dataset and not deviate too far from the truth. A test to validate how well you have done this is the following. If the other parameters provide enough evidenced insight to impute with a level of precision for missing data....it should be able to do it with existing data.

So if 60 percent of the column is missing, take the row observations where this column is PRESENT.

Next, randomly choose to remove 60% of this subsetted data. Now run imputation methods of your choosing.

Compare the imputed dataset to the real data set for similarity. Decide if they are close enough for you to then run this against the full data set. At least this approach will give you a leg to stand on if you need to defend yourself.

Fight the Good Fight.

How do you deal with missing data when it's missing like 60%?

3 Answers3