-1

I have a data frame with some NAs in column 'myvalues':

x <- data.frame(mydates = as.Date(c("2018/04/01","2018/04/02","2018/04/03","2018/04/04",
                                                     "2018/04/05","2018/04/06","2018/04/07","2018/04/08",
                                                     "2018/04/09","2018/04/10","2018/04/11")),
                                 myvalues = c(2.3, NA, 2.1, 2.2, NA, 2.4, 2.3, 2.1, NA, NA, 2.6))
x

I'd like to replace each NA with the mean of the previous two values. For the NA in row 2, it should be equal to the value in row 1. I could do it using a 'for' loop through NAs in x$myvalues. However, it's very slow, and I am looking for a fast solution, because I'll have to do the same thing for millions on tiny data frames like x.

Thank you very much!

user3245256
  • 1,842
  • 4
  • 24
  • 51
  • Also, please don't suggest na.ma function from package 'imputeTS'. It's a nice function, but it borrows from both sides - and it's important to me to keep the lack of variance that I get if I borrow only from the earlier time periods. – user3245256 Apr 09 '18 at 19:44
  • As my previous comment implies - each new imputed value should be the mean of two previous (earlier) values - AFTER all of them have been imputed (if needed). – user3245256 Apr 10 '18 at 14:49
  • Thus, if we have 3 NAs in a row, the first should be imputed. The second is the mean of the (imputed) previous value and value before that. The third is the mean of the two (imputed) previous values, etc. – user3245256 Apr 10 '18 at 14:52

3 Answers3

1

A data.table solution that should be fairly speedy:

library(data.table)

x <-
  data.frame(
    mydates = as.Date(
      c(
        "2018/04/01",
        "2018/04/02",
        "2018/04/03",
        "2018/04/04",
        "2018/04/05",
        "2018/04/06",
        "2018/04/07",
        "2018/04/08",
        "2018/04/09",
        "2018/04/10",
        "2018/04/11"
      )
    ),
    myvalues = c(2.3, NA, NA, 2.2, NA, NA, 2.3, NA, NA, NA, 2.6)
  )

# Carry forward mean of last two non-missing values
setDT(x)

x[, segment := cumsum(!is.na(myvalues))]
x[, last1 := myvalues[1], by = segment]
x[!is.na(myvalues), segment2:=segment]
x[is.na(myvalues), segment2:=segment-1]
x[, last2 := myvalues[1], by = segment2]
x[, repl:=rowMeans(.SD, na.rm=T), .SDcols=c("last1", "last2")]
x[, myvalues2:=myvalues]
x[is.na(myvalues2) & !is.nan(repl), myvalues2:=repl]

x[, list(mydates, myvalues, myvalues2)]
# mydates myvalues myvalues2
# 1: 2018-04-01      2.3      2.30
# 2: 2018-04-02       NA      2.30
# 3: 2018-04-03       NA      2.30
# 4: 2018-04-04      2.2      2.20
# 5: 2018-04-05       NA      2.25
# 6: 2018-04-06       NA      2.25
# 7: 2018-04-07      2.3      2.30
# 8: 2018-04-08       NA      2.25
# 9: 2018-04-09       NA      2.25
# 10: 2018-04-10       NA      2.25
# 11: 2018-04-11      2.6      2.60
pbaylis
  • 1,529
  • 11
  • 19
  • Thank you, but it doesn't work for this x: x <- data.frame(mydates = as.Date(c("2018/04/01","2018/04/02","2018/04/03","2018/04/04", "2018/04/05","2018/04/06","2018/04/07","2018/04/08", "2018/04/09","2018/04/10","2018/04/11")), myvalues = c(2.3, NA, NA, 2.2, NA, NA, 2.3, NA, NA, NA, 2.6)) – user3245256 Apr 10 '18 at 00:23
  • I see. The question isn't entirely clear as written: you should clarify to indicate that you mean two previous _non-missing_ values, as that wasn't obvious to me. I've modified my answer. – pbaylis Apr 10 '18 at 05:45
  • Sorry, but it still doesn't calculate correctly. Each (imputed) value should not be equal to the mean of the two earlier non-missing values, but to the mean of the two previous values - AFTER they've been imputed. In other words, the results should match those from the Reduce response (x$values = Reduce(function(x,y)... – user3245256 Apr 10 '18 at 14:08
1

You can reduce the Reduce function. In this case, for example, the last NA will be the average value of the previous two values, but it first fills the previous value then uses that to obtain the current value

x$myvalues=Reduce(function(x,y)if(is.na(y))c(x,mean(tail(x,2))) else c(x,y),x$myvalues)
> x
      mydates myvalues
1  2018-04-01     2.30
2  2018-04-02     2.30
3  2018-04-03     2.10
4  2018-04-04     2.20
5  2018-04-05     2.15
6  2018-04-06     2.40
7  2018-04-07     2.30
8  2018-04-08     2.10
9  2018-04-09     2.20
10 2018-04-10     2.15
11 2018-04-11     2.60
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

A vectorised solution, if you need speed:

x = c(2.3, NA, 2.1, 2.2, NA, 2.4, 2.3, 2.1, NA, NA, 2.6)
y <- which(!is.na(x))[findInterval(which(is.na(x)), which(!is.na(x)))]
y[y==1] = NA
x[which(is.na(x))] = (x[y-1] + x[y])/2
x
# [1] 2.30   NA 2.10 2.20 2.15 2.40 2.30 2.10 2.20 2.20 2.60

The above version has NA as 2nd value becuase there are not 2 values before the first NA from which to take the mean. If instead you want this NA to be the mean of the only value preceding it, then we can do this instead:

y <- which(!is.na(x))[findInterval(which(is.na(x)), which(!is.na(x)))]
x[which(is.na(x))] = (x[pmax(1,y-1)] + x[y])/2
# [1] 2.30 2.30 2.10 2.20 2.15 2.40 2.30 2.10 2.20 2.20 2.60
dww
  • 30,425
  • 5
  • 68
  • 111
  • Not sure about this one. The correct solution should be: 2.3, 2.3, 2.1, 2.2, 2.15, 2.4, 2.3, 2.1, 2.2, 2.15, 2.6 – user3245256 Apr 10 '18 at 14:14
  • The required behaviour when there are not 2 values before an NA was a bit unclear. Updated to show how to take first value of vector when there are not two values from which to take a mean. NB the correct solution (according to the most accurate interpretation of how the question was written) does not have 2.15 as the second last value. The mean of 2.1 and 2.3 is 2.2 as in my solution, but not the accepted answer. – dww Apr 10 '18 at 14:42
  • I checked the Reduce solution again. It's correct. The second section of the solution you offered is closer, but still has one incorrect value - second from the end. After the imputation, the 3rd value from the end becomes (2.3 + 2.1)/2 = 2.2 (correct!). But the 2nd value from the end should be: (2.1 + 2.2)/2 = 2.15. – user3245256 Apr 10 '18 at 14:44
  • Your question was underspecified. It is the question, not the solution, that is incorrect. You did not say that the mean of imputed values was required. You only said that the mean of the previous values was wanted – dww Apr 10 '18 at 14:46
  • I agree and apologize my question was under-specified. I added a comment to clarify. – user3245256 Apr 10 '18 at 14:50