Impute missing values using the available obs from the corresponding group using r

Question

Hi I tried to impute missing values using the available values in a corresponding group. Please see the following data for an example.

dput(question)
structure(list(Group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("A", "B"), class = "factor"), Year = c(2004L, 2005L, 
2006L, 2007L, 2006L, 2007L, 2008L), Score = c(NA, 100L, NA, 95L, 
NA, NA, 88L)), .Names = c("Group", "Year", "Score"), class = "data.frame", row.names = c(NA, 
-7L))

For the first NA score for group A in year 2004, I would like to use the available obs from the closest year in the same group (that is 100 for Group A year 2005); For NA in group A year 2006, I would like to use the average of score from 2005 and 2007 in group A; For NS in group B year 2006 and 2007, I would like to use the number in 2008 for group B.

Is it any r package for imputation that is applicable to my cases? or do you have any suggestion on such imputation?

Really appreciate

Updated I amended PsyNeuroSci’s macro so that the distance will be calculated using Year. Sorry I did not know how to put the amended codes after PsyNeuroSci’s.

impute_nearest = function(dat, var0, var){
  for(i in 1:length(dat[,var])){
    if(is.na(dat[,var][i])){
      na.pos <<- dat[, var0][i]
      non.na.pos <<- dat[, var0][which(!is.na(dat[,var]))]
      distance <<- min(abs(na.pos-non.na.pos))

          dat[,var][i] = mean(c(dat[which(dat[, var0]==(na.pos+distance)),var], dat[which(dat[, var0]==(na.pos-distance)),var]), 
                              na.rm=T)
    }
  }
     return(dat)
}

possible duplicate of [Imputation in R](http://stackoverflow.com/questions/13114812/imputation-in-r) — nograpes, Oct 07 '14 at 20:59
I don't see the whole logic behind your imputation wishes. Just checking: If you have only one close value (within 1 year difference) in the same group you want NA to be assigned to that value; if you have two values you want the mean of those; if you have no value within one year you take the closest available? — abel, Oct 07 '14 at 21:12
logic for imputation: To use the obs from the closest year(S) in the same group to replace NA. In the 2nd case, since 2005 and 2007 are both 1 year apart from 2006 in group A (ie. both are nearest nearest neighbors), the average of obs in 2005 and 2007 used. In case 3, the closest year with available score for 2006 and 2007 in group B is the score in 2008, so the score in 2008 group B will be assigned to both 2006 and 2007. Hope it is clear. Thanks. — user2037892, Oct 07 '14 at 21:25
Hi nograpes, would you please specify a little bit? Thank you very much — user2037892, Oct 07 '14 at 21:27

abel · Accepted Answer · 2014-10-07T22:19:56.680

I highly doubt that using a single value or the mean of only two to impute your missings is a good idea. Imagine you have a huge outlier in year 2010 and the 5 values before it are all NAs.

However, here is a working solution for your problem. I know it is probably not the most elegant way, but it works and with that I let others suggest better ways to do it.

Split up your data based on the Groups:

datA=dat[dat$Group=="A",]
datB=dat[dat$Group=="B",]

Here is a short function that takes the mean of the closet nonmissing values of the same distance around the missing value.

impute_nearest = function(dat, var){
   for(i in 1:length(dat[,var])){
      if(is.na(dat[,var][i])){
         na.pos = which(is.na(dat[,var]))
         non.na.pos = which(!is.na(dat[,var]))
         distance = abs(na.pos-non.na.pos)

         dat[,var][i] = mean(c(dat[,var][i+distance], dat[,var][i-distance]), 
         na.rm=T)
      }
   }
   return(dat)
}

Use as follows with dat being your dataframe and var the name of the variable you wish to impute in that df. Give the Variable as a character (e.g. "Score"). Here is an example for one of the subsets of your data:

impute_nearest(datB, "Score")

Hi PsyNeuroSci,Thank you for the suggestion. But it seems to me that the distance is not calculated based on Year. For example, It I change the year of 2007 in Group A into 2008, the obs that suppose to be used for NA in Year 2006 should be 100 from year 2005 instead of the the mean score of 2005 and 2007. — user2037892, Oct 08 '14 at 14:06
True, there is a mistake in the code (Guess it was late yesterday:). I'll get to that later and will put the corrected version up. Thanks for the feedback! — abel, Oct 08 '14 at 15:41

Impute missing values using the available obs from the corresponding group using r

1 Answers1