0

So, I have the following problem: I have a data set, A (data.table object), of the following structure:

date days rate 1996-01-02 9 5.763067 1996-01-02 15 5.745902 1996-01-02 50 5.673317 1996-01-02 78 5.608884 1996-01-02 169 5.473762 1996-01-03 9 5.763067 1996-01-03 14 5.747397 1996-01-03 49 5.672263 1996-01-03 77 5.603705 1996-01-03 168 5.470584 1996-01-04 11 5.729460 1996-01-04 13 5.726104 1996-01-04 48 5.664931 1996-01-04 76 5.601891 1996-01-04 167 5.468961

Note that the days column and its size may differ for each day. My goal is now to (piecewise linearly) interpolate rate along days. I am doing this for each day via

approx(x=A[,days],y=A[,rate],xout=days_vec,rule=2)

where days_vec <- min_days:max_days, i.e. the days range I am interested in (say 1:100).

I have two problems here:

  1. approx only interpolates, i.e. it does not create a linear fit across min(x) and max(x). If I am now interested in days 1:100, I first need to do it by hand using days 9 and 15 (first 2 lines of A) via:

    first_days <- 1:(A[1,days]-1) #1:8 rate_vec[first_days] <- A[1,rate] + (first_days - A[1,days])/(A[2,days]-A[1,days])*(A[2,rate]-A[1,rate])

and then using the approx line above for rate_vec[9:100]. Is there a way of doing this in 1 step?

  1. Right now, given that I need two steps and the shift point between the two procedures (here 9) differs among dates, I cannot see an implementation via data.table, although this would be vastly preferred (using data.table methods to interpolate/extrapolate and then returning the expanded data.table object). Thus, I currently run a for loop through the dates, which is of course much, much slower.

Question: Is the problem above better implementable and also, is this somehow doable with data.table methods instead of looping through A?

Daedalus
  • 235
  • 1
  • 2
  • 9
  • Can you be a little more clear about what you are trying to do? It sounds like you are trying to do threethings: (1) create additional rows for values of `days` that are not present in the table (2) create a `lm` for `rate ~ days` for each group (defined by `date`) (3) use `predict.lm` to create values of `rate` for `days` where `rate` is missing – C8H10N4O2 Sep 05 '17 at 14:54

1 Answers1

5

How about something like this.

# please try to make a fully reproducible example!
library(data.table)
df <- fread(input=
"date       days     rate
1996-01-02    9 5.763067
1996-01-02   15 5.745902
1996-01-02   50 5.673317
1996-01-02   78 5.608884
1996-01-02  169 5.473762
1996-01-03    9 5.763067
1996-01-03   14 5.747397
1996-01-03   49 5.672263
1996-01-03   77 5.603705
1996-01-03  168 5.470584
1996-01-04   11 5.729460
1996-01-04   13 5.726104
1996-01-04   48 5.664931
1996-01-04   76 5.601891
1996-01-04  167 5.468961")
df[,date := as.Date(date)]

1. Create NA values of rate for days in range 1:100 that aren't in dataset

df <- 
  merge(df,
        expand.grid( days=1L:100L,                   # whatever range you are interested in
                     date=df[,sort(unique(date))] ), # dates with at least one observation
        all=TRUE # "outer join" on all common columns (date, days)
        )

2. For each value of date, use a linear model to predict NA values of rate.

df[, rate := ifelse(is.na(rate),
                    predict(lm(rate~days,.SD),.SD), # impute NA w/ lm using available data
                    rate),                          # if not NA, don't impute
   keyby=date]

Gives you:

head(df,10)
#           date days     rate
#  1: 1996-01-02    1 5.766787 <- rates for days 1-8 & 10 are imputed 
#  2: 1996-01-02    2 5.764987
#  3: 1996-01-02    3 5.763186
#  4: 1996-01-02    4 5.761385
#  5: 1996-01-02    5 5.759585
#  6: 1996-01-02    6 5.757784
#  7: 1996-01-02    7 5.755983
#  8: 1996-01-02    8 5.754183
#  9: 1996-01-02    9 5.763067 <- this rate was given
# 10: 1996-01-02   10 5.750581

If there are values of date without at least two observations of rate, you will probably get an error because you won't have enough points to fit a line.

Alternative: Rolling joins solution for piecewise linear interpolation

This requires rolling joins to left and right, and an average of the two that ignores NA values.

This doesn't do well for extrapolation, though, since it's just a constant (either the first or last obs) outside the indices of observations.

setkey(df, date, days)

df2 <- data.table( # this is your framework of date/days pairs you want to evaluate
        expand.grid( date=df[,sort(unique(date))],
                     days=1L:100L),
        key = c('date','days')
)

# average of non-NA values between two vectors
meanIfNotNA <- function(x,y){
  (ifelse(is.na(x),0,x) + ifelse(is.na(y),0,y)) /
    ( as.numeric(!is.na(x)) + as.numeric(!is.na(y)))
}

df3 <- # this is your evaluations for the date/days pairs in df2.
  setnames(
    df[setnames( df[df2, roll=+Inf], # rolling join Last Obs Carried Fwd (LOCF)
                 old = 'rate',
                 new = 'rate_locf' 
               ),
     roll=-Inf], # rolling join Next Obs Carried Backwd (NOCB)
    old = 'rate',
    new = 'rate_nocb'
  )[, rate := meanIfNotNA(rate_locf,rate_nocb)]
# once you're satisfied that this works, you can include rate_locf := NULL, etc.

head(df3,10)
#          date days rate_nocb rate_locf     rate
#  1: 1996-01-02    1  5.763067        NA 5.763067
#  2: 1996-01-02    2  5.763067        NA 5.763067
#  3: 1996-01-02    3  5.763067        NA 5.763067
#  4: 1996-01-02    4  5.763067        NA 5.763067
#  5: 1996-01-02    5  5.763067        NA 5.763067
#  6: 1996-01-02    6  5.763067        NA 5.763067
#  7: 1996-01-02    7  5.763067        NA 5.763067
#  8: 1996-01-02    8  5.763067        NA 5.763067
#  9: 1996-01-02    9  5.763067  5.763067 5.763067  <- this rate was given
# 10: 1996-01-02   10  5.745902  5.763067 5.754485
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • 1
    I didn't see the "piecewise linear" requirement above -- if this doesn't work for you and you really just want information from the next observations to the left and right, I can edit to incorporate rolling joins. But please explain how you want to handle endpoints first (i.e. missing values before first observation and after last observation). – C8H10N4O2 Sep 05 '17 at 15:36
  • Great, thanks!! This definitely helps a lot and runs very smoothly. The reason I wanted it piecewise linear is that the rates do not follow a linear pattern but rather a concave one, which I wanted to approximate by several piecewise linear interpolations to get closer to the true function. My idea was to handle missing values by continuing the linear trend between the previous two nodes (see linear interpolation function for case 1:8 above). Unfortunately, I am not yet very familiar with the rolling joins of the data.table package, so if you could edit this above that'd be great! – Daedalus Sep 05 '17 at 17:56
  • An alternative would be using the whole curve with your approach and taking splines instead of the linear models and predict the na values based on the smoothed splines. I will look into this as well. – Daedalus Sep 05 '17 at 18:27
  • @Daedalus here's the rolling joins. Note that it uses nearest two points for linear **interpolation** but only one for **extrapolation** which just gives you a constant. Depending on how much you care about extrapolation, you might be better off with the splines or some really custom approach. You could also combine to do interpolation with rolling joins and then something else for extrapolation. The purpose of this answer was to help you do things efficiently in data.table. – C8H10N4O2 Sep 05 '17 at 18:58
  • Thanks caffeine guy! That helped a lot. I think I will try both the spline version with data.table (which should be implementable in the same way as the linear models I think) and the linear rolling joins, and compare what fits better. Cheers – Daedalus Sep 05 '17 at 19:17