0

Consider the following working example:

library(data.table)
library(imputeTS)

DT <- data.table(
  time = c(1:10),
  var1 = c(1:5, NA, NA, 8:10),
  var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
  var3 = c(1:6, rep(NA, 4))
)

        time var1 var2 var3
 1:    1    1   NA    1
 2:    2    2   NA    2
 3:    3    3    1    3
 4:    4    4    2    4
 5:    5    5    3    5
 6:    6   NA    4    6
 7:    7   NA   NA   NA
 8:    8    8    6   NA
 9:    9    9    7   NA
10:   10   10    8   NA

I want to impute the missing values at different points within the time series using the na_interpolation from the imputeTS package. However, I do not want to impute missing values at the beginning or the end of the series which can be of various length (In my application replacing those values would not make sense).

When I run the following code to impute the series, however all the NAs get replaced:

DT[,(cols_to_impute_example) := lapply(.SD, na_interpolation), .SDcols = cols_to_impute_example]
> DT
    time var1 var2 var3
 1:    1    1    1    1
 2:    2    2    1    2
 3:    3    3    1    3
 4:    4    4    2    4
 5:    5    5    3    5
 6:    6    6    4    6
 7:    7    7    5    6
 8:    8    8    6    6
 9:    9    9    7    6
10:   10   10    8    6

What I want to achieve is:

    time var1 var2 var3
 1:    1    1   NA    1
 2:    2    2   NA    2
 3:    3    3    1    3
 4:    4    4    2    4
 5:    5    5    3    5
 6:    6    6    4    6
 7:    7    7    5   NA
 8:    8    8    6   NA
 9:    9    9    7   NA
10:   10   10    8   NA
Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
Florestan
  • 127
  • 1
  • 15

3 Answers3

2

a dplyr implementation: we select the middle part of the df where we do the NA interpolation and then we bind it back.

  library(imputeTS)
  library(dplyr)

  DT <- data_frame(
    time = c(1:10),
    var1 = c(1:5, NA, NA, 8:10),
    var2 = c(NA, NA, 1:4, NA, 6, 7, 8),
    var3 = c(1:6, rep(NA, 4))
  )

  na_inter_middle<-function(row_start, row_end){

  # extracts the first part of the df where no NA need to be replaced
  DT[1:row_start,]->start 
  # middle part, interpolating NA values
  DT[(row_start + 1):(nrow(DT) - row_end),]->middle
  #end part
  DT[(nrow(DT) - (row_end - 1) ):nrow(DT),]->end


  start %>% 
    bind_rows(

  middle %>% 
    mutate_all(na.interpolation)

    ) %>% 
    bind_rows(end)

  }

  na_inter_middle(2,3)  


# A tibble: 10 x 4
    time  var1  var2  var3
   <int> <dbl> <dbl> <dbl>
 1     1     1    NA     1
 2     2     2    NA     2
 3     3     3     1     3
 4     4     4     2     4
 5     5     5     3     5
 6     6     5     4     6
 7     7     5     4     6
 8     8     8     6    NA
 9     9     9     7    NA
10    10    10     8    NA
  • 1
    Thanks for the answer. In my application, I have several hundreds of columns / time series, so manually specifying the row start is not feasible. I think the na.approx of the zoo package (as commented by Roland before) seems to be the solution. – Florestan Sep 16 '19 at 12:21
2

Maybe not so well known, you can also use additional parameters from approx in the na.interpolation function of imputeTS.

This one could be solved with:

library(imputeTS)
DT[,(2:4) := lapply(.SD, na_interpolation, yleft = NA , yright = NA), .SDcols = 2:4]

Here with yleft and yright you specify what to do with the trailing / leading NAs.

Which leads to the desired output:

time var1 var2 var3
 1:    1    1   NA    1
 2:    2    2   NA    2
 3:    3    3    1    3
 4:    4    4    2    4
 5:    5    5    3    5
 6:    6    6    4    6
 7:    7    7    5   NA
 8:    8    8    6   NA
 9:    9    9    7   NA
 10:   10   10    8   NA

Basically nearly all parameters that you find on the approx function description can also be given to the na.interpolation function as additional parameters for finetuning.

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
1

Library zoo offers a function for interpolation that allows more customization:

library(zoo)
DT[,(2:4) := lapply(.SD, na.approx, x = time, na.rm = FALSE), .SDcols = 2:4]
Roland
  • 127,288
  • 10
  • 191
  • 288
  • It's a little bit harder to find in the documentation, but can be done in a similar way with imputeTS: `DT[,(2:4) := lapply(.SD, na_interpolation, yleft = NA , yright = NA), .SDcols = 2:4]` For both imputeTS and zoo it is possible to also use parameters from the approx function itself - sometimes this is quite useful like in this case. – Steffen Moritz Nov 16 '19 at 00:02