10

I'm trying to use the na.approx() function from the zoo library (in conjunction with xts) to interpolate missing values from repeated measures data for multiple individuals with multiple measurements.

Sample data...

event.date <- c("2010-05-25", "2010-09-10", "2011-05-13", "2012-03-28", "2013-03-07",    
                "2014-02-13", "2010-06-11", "2010-09-10", "2011-05-13", "2012-03-28",
                "2013-03-07", "2014-02-13")
variable   <- c("neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd",
                "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd")
value      <- c(0.7490, 0.7615, 0.7900, 0.7730, NA, 0.7420, 1.0520, 1.0665, 1.0760,
                1.0870, NA, 1.0550)
## Bind into a data frame
df <- data.frame(event.date, variable, value)
rm(event.date, variable, value)
## Convert date
df$event.date <- as.Date(df$event.date)
## Load libraries
library(magrittr)
library(xts)
library(zoo)

I can interpolate one missing data point for a single outcome for a given person using xts() and na.approx()....

## Subset one variable
wbody <- subset(df, variable == "wbody.bmd")
## order/index and then interpolate
xts(wbody$value, wbody$event.date) %>%
  na.approx()
2010-06-11 1.052000
2010-09-10 1.066500
2011-05-13 1.076000
2012-03-28 1.087000
2013-03-07 1.070977
2014-02-13 1.055000

Not ideal having a matrix returned, but I can work around that. The main problem I have though is that I've multiple outcomes for multiple people. I, perhaps naively thought that since this is therefore a split-apply-combine problem that I could utilise dplyr to achieve this in the following manner...

## Load library
library(dplyr)
## group and then arrange the data (to ensure dates are correct)
df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
      xts(.$value, .$event.date) %>%
        na.approx()

Error in xts(., .$value, .$event.date) : order.by requires an appropriate time-based object

It seems that dplyr doesn't play well with xts/zoo and I've spent a couple of hours searching around trying to find tutorials/examples on how to interpolate missing data points in R, but all I've found are single case examples and so far I've been unable to find anything on how to do this for multiple sites for multiple people (I realise I could make it just a multiple people problem by reshaping my data to wide but that still wouldn't solve the problem I'm encountering).

Any thoughts/advice/insights on how to proceed would be greatly appreciated.

Thanks

EDIT : Clarification that some functions come from zoo package.

UseR10085
  • 7,120
  • 3
  • 24
  • 54
slackline
  • 2,295
  • 4
  • 28
  • 43
  • 2
    I'm not familiar with `xts` but maybe you're looking for this: `df %>% group_by(variable) %>% arrange(variable, event.date) %>% mutate(value = na.approx(value))` where na.approx is from zoo package. If you want to modify columns using dplyr, you normally do that within a `mutate` call or, if you hava arbitrary functions, inside a `do` call. – talat Jan 13 '15 at 11:56
  • AFAIK, there's no `na.approx` function in xts (just installed it) – talat Jan 13 '15 at 12:05
  • @slackline, does the code in my first comment do what what you expected? – talat Jan 13 '15 at 12:37
  • @docendodiscimus yes, thats looking promising with this simplified example, I'll investigate it with my larger data frame (stupidly I am already aware of `mutate()` but wasn't thinking in terms of adding columns to my existing data frame, instead aiming for a separate data frame with imputed values to fit in with the rest of my work flow). – slackline Jan 13 '15 at 12:49
  • @docendodiscimus @shadow yes, my apologies, I've conflated commands from `xts` with those from `zoo` (confused myself reading that `xts` extends the time-series functions in `zoo`) – slackline Jan 13 '15 at 12:56
  • `Error in xts(., .$value, .$event.date)` shows that `magrittr`/`dplyr` is passing `.` as the first argument to `xts()`, when you really wanted `xts(.$value, .$event.date)`. – Joshua Ulrich Jan 13 '15 at 13:07
  • @JoshuaUlrich I did use `xts(.$value, .$event.date)` though, its unclear to me why its interpreted as `xts(., .$value, .$event.date)`. Do you have any idea why this might be? – slackline Jan 13 '15 at 13:17
  • I don't use dplyr or magrittr, so this is just a guess, but it's probably because `%>%` passes the entire object as a first argument unless you use named arguments to specify otherwise. So you probably need something like `xts(x=.$value, order.by=.$event.date)`. – Joshua Ulrich Jan 13 '15 at 13:26
  • Looking promising using @docendodiscimus suggestion of using `mutate()` to modify the existing data frame, although I've got some problematic instances where there is only one data point so no interpolation can be achieved. Worked round some problems by increasing the `maxgap = 4` and setting the `rule = 2` to deal with missing data points at first and last observation times. – slackline Jan 13 '15 at 14:00
  • @JoshuaUlrich tried that but `mutate()` doesn't seem to like it, complaining with `no applicable method for 'mutate_' applied to an object of class "c('xts', 'zoo')"`. Fortunately it seems I don't need to use the call to `xts()` though. – slackline Jan 13 '15 at 14:02

2 Answers2

17

Use the approx() function for linear-interpolation:

df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
    mutate(time=seq(1,n())) %>%
      mutate(ip.value=approx(time,value,time)$y) %>%
      select(-time)

or the spline function for non-linear interpolation:

df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
    mutate(time=seq(1,n())) %>%
      mutate(ip.value=spline(time,value ,n=n())$y) %>%
      select(-time)
sumtxt
  • 181
  • 1
  • 4
  • Worked for me. I don't think you need the 'arrange' step for the interpolation to work correctly. – Scrope Feb 23 '23 at 10:55
15

The solution I've gone with is based on the first comment from @docendodiscimus

Rather than attempt to create a new data frame as I'd been doing this approach simply adds columns to the existing data frame by taking advantage of dplyr's mutate() function.

My code is now...

df %>%
  group_by(variable) %>%
    arrange(variable, event.date) %>%
      mutate(ip.value = na.approx(value, maxgap = 4, rule = 2))

The maxgap allows upto four consecutive NA's, whilst the rule option allows extrapolation into the flanking time points.

slackline
  • 2,295
  • 4
  • 28
  • 43