-1

I have the following dataset:

5   3   3   5   10  10  3   8   2   12  8   6   2   5   6   5   10  4   3   5   4   3   3   5   8   3   5   6   6   1   10  3   6   6   5   8   3   4   3   4   4   3   2.5 1   4   2   2   3   5   10  4   4   6   3   2   3   8   3   4   4   3   3   4   8   4   4   2   4   4   3   2   10  6   3   7   3   5   3   1   4   3   4   3   4   4   2   3   2   4   7   4   6   3.5 3.5 5   3   4   3   5   3   1.5 2.5 3   7   2   5   3   4   2   4   5   3   4   5   4.5 4   6   3   2   1   3   2   2   3   4   6   2   4   2   3   6   1.5 3   3   1   4   3   3   2   3   2   2   6   3   15  1   4   5   2   6   2   4   8   2   8   4   4   4   3   8   4   4   8.5 3   2   7   0.5 3   3   3   2   3   2   4   5   6   2   3.5 3   3   2   2   2.5 2   2   5   2   8   2   4   3   3   2   7   2   4   2   4   4   3   2.5 3   3   3   5 NA NA NA NA NA  NA NA NA NA NA NA NA NA NA NA

I want to replace NA's using either Mean or Median value imputation method.

Which method would be appropriate in such a case, and why?

Please help me learning.

Thanks.

In R I am trying the same with Median using:

# replacing with Median
df$val[is.na(df$val)] <- with(df, 
                                  ave(val, FUN = function(x) 
                                            median(x, na.rm = TRUE)) [is.na(df$val)]

I have a feeling that this is not correct way of imputation.

Can someone help in clarifying my doubts:

  1. Will there be any effects on median imputation, given that there are some values with high frequencies and others with low freq.
  2. Because of outliers, imputation with "mean" would not be a good idea. So what alternative methods could be there?

Thanks.

Julien
  • 13,986
  • 5
  • 29
  • 53
Madhu Sareen
  • 549
  • 1
  • 8
  • 20
  • which source are you following for ML?? – Shubham Agarwal Bhewanewala Mar 28 '17 at 05:25
  • Why are you using `ave`? It's not necessary - `val[is.na(val)] <- median(val,na.rm=TRUE)` will do it – thelatemail Mar 28 '17 at 05:29
  • Sorry but this doesn't answer my question... I need to clarify my doubts, rest I will do.. I didn't ask for help on coding but on methodology... – Madhu Sareen Mar 28 '17 at 05:32
  • 1
    Comments aren't for answers... hence they are called comments, not answers. They are (generally) for feedbacks that doesn't answer your question but would help you in one way or another. – spicypumpkin Mar 28 '17 at 05:34
  • 4
    "*I didn't ask for help on coding but on methodology*" - Stackoverflow is specifically for coding assistance, not methodological assistance. http://stats.stackexchange.com would be more appropriate. – thelatemail Mar 28 '17 at 05:38

3 Answers3

2

it depends on the distribution of data. if there are many outiers use median for missing value imputation.

best is to do

data is df$val

df2$val=na.omit(df$val)

summary(df2$val)

hist(df2$val)

then

Replacing by mean

df$val=ifelse(is.na(df$val),mean(df$val,na.rm=T),df$val)

Replacing by median

df$val=ifelse(is.na(df$val),median(df$val,na.rm=T),df$val)
Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60
  • Thanks a lot Ajay. I will appreciate if I could get some more insights onto result interpretations. That is, after inmputation, are there any changes in the basic properties of the data or have a lost any information due to imputation etc. :-) – Madhu Sareen Mar 28 '17 at 06:16
  • 1
    use plot(density(df$val)) to see it graphically and skewness(df$val) and kurtosis(df$val) to see skewed distribution. For outliers- there is a function OutlierTest from car package https://www.rdocumentation.org/packages/car/versions/2.1-4/topics/outlierTest – Ajay Ohri Mar 28 '17 at 06:22
1

For your second point, you've already put forth the approach. If you are worried about outliers a median imputation would be more appropriate than mean imputation.

As for the first point, it should not be a problem for the data given as the median computation throws out most the data and focus on the values in the middle.

student
  • 1,001
  • 2
  • 12
  • 24
1

Mean and median are for most of the datasets among the worst imputation methods. (of course always depends on the dataset, there are also datasets where these are ok)

In general to get the best imputation results you are looking into correlations between variables or correlations of one variable in time.

Thus would be interesting to see your whole dataframe (to see if there are correlations)

If you just want to impute with mean or median here are some quick methods

#mean
library("imputeTS")
na.mean(df$val, option ="mean")

#median
library("imputeTS")
na.mean(df$val, option ="median")
Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
  • Thanks very much for your advice, I really appreciate it. Can also please add an example in which you would go for correlation based imputation. it will surely add helps to the community. thanks in advance. – Madhu Sareen Apr 22 '17 at 17:31