0

My data has a lot of missing values and I have to predict those values. One way is to take the average of those values. But I want to hear an other perspective on it. How experienced data scientist solve such kind of issue?

hammadshahir
  • 350
  • 4
  • 7

3 Answers3

-1

Are your missing values categorical or continuous?

One way is to remove the samples entirely, however this may lead to a sampling bias, since the missing values could have been the result of some causal effect, that is the missing values are not missing completely at random.

If your data has enough dimensionality, you can treat your missing values as the output and try to apply a predicting model and hope that it can faithfully estimate the missing values, given the explanatory variables you already have.

Picking the most frequent value, the median, or averaging as you point out could also be an option, however be careful with outliers when averaging as these can have a tremendous effect on the mean.

Bjarke Kingo
  • 400
  • 7
  • 14
  • I am trying to remove NaN values with dataframe = dataframe.dropna() but it is not working. If I can somehow remove NaN values, I can predict missing values based on other variables with simple linear regression model. – hammadshahir Jul 24 '19 at 20:08
-1

It depends on nature of variables, it may be some statistics like mean or median. Another practice is assign to missing variables some value different from others for example 0, -1 or something like this.

-1

The hardest approach is to impute the dataset and not deviate too far from the truth. A test to validate how well you have done this is the following. If the other parameters provide enough evidenced insight to impute with a level of precision for missing data....it should be able to do it with existing data.

So if 60 percent of the column is missing, take the row observations where this column is PRESENT.

Next, randomly choose to remove 60% of this subsetted data. Now run imputation methods of your choosing.

Compare the imputed dataset to the real data set for similarity. Decide if they are close enough for you to then run this against the full data set. At least this approach will give you a leg to stand on if you need to defend yourself.

Fight the Good Fight.

rayphaistos1
  • 11
  • 1
  • 3