-1

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that missing data gets imputed based on the surrounding values within a given subject. I'd like to use the mean of the closest previous and closest subsequent values for the participant. If there is no subsequent value present, then I'd like to use the previous value carried forward until a subsequent value is present.

I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

The details are below:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(2,2,4,3,NA,0,0,1,4,0,NA,0,0,0,4,2,1,3,3,2,NA,3,4,3,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*Bold and underlined characters represent changes from the dataset above

The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 2,2,4,3,1.5,0,0

ID# 2 (variable ss) to look like this: 1,4,0,0,0,0,0

ID #3 (variable ss) to look like this: 4,2,1,3,3,2,NA (no change because the row with NA will be deleted eventually)

ID #4 (variable ss) to look like this: 3,4,3,3,1.5,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).

Jonah M.
  • 5
  • 1
  • 1
    You probably know that in general it's best practice to show your attempts to implement the algorithm. Otherwise, it may look like misusing the community as coding service, which is not cool. – lukeA Dec 20 '16 at 22:13
  • Thanks for the heads up. I didn't realize that. I'll be sure to do that next time. – Jonah M. Dec 20 '16 at 22:31
  • Subsequent readers of this Q&A should realize that the propose "imputation" process will invalidate the statistical inferences from the data because the covariates will have less variability than reality. No noise was introduced, so this is not similar to the usual statistical imputation methods. – IRTFM Dec 28 '16 at 17:18

1 Answers1

0

If processing speed is not the issue (I guess "ID #4" makes it hard to vectorize imputations), then maybe try:

f <- function(x) {
  idx <- which(is.na(x))
  for (id in idx) {
    sel <- x[id+c(-1,1)]
    if (id < length(x)) 
      sel <- sel[!is.na(sel)]
    x[id] <- mean(sel)
  }
  return(x)                 
}
cbind(mydat, ss_imp=ave(mydat$ss, mydat$id, FUN=f))
#    id time ss ss_imp
# 11  1    0  2    2.0
# 12  1    1  2    2.0
# 13  1    2  4    4.0
# 14  1    3  3    3.0
# 15  1    4 NA    1.5
# 16  1    5  0    0.0
# 17  1    6  0    0.0
# 21  2    0  1    1.0
# 22  2    1  4    4.0
# 23  2    2  0    0.0
# 24  2    3 NA    0.0
# 25  2    4  0    0.0
# 26  2    5  0    0.0
# 27  2    6  0    0.0
# 31  3    0  4    4.0
# 32  3    1  2    2.0
# 33  3    2  1    1.0
# 34  3    3  3    3.0
# 35  3    4  3    3.0
# 36  3    5  2    2.0
# 37  3    6 NA     NA
# 41  4    0  3    3.0
# 42  4    1  4    4.0
# 43  4    2  3    3.0
# 44  4    3 NA    3.0
# 45  4    4 NA    1.5
# 46  4    5  0    0.0
# 47  4    6  0    0.0
lukeA
  • 53,097
  • 5
  • 97
  • 100