2

I'm trying to add missing lines for "day" and extrapolate the data for "value". In my data each subject ("id") has 2 periods (period 1 and period 2) and values for consecutive days.

An example of my data looks like this:

df <- data.frame(
  id  =    c(1,1,1,1,  1,1,1,1,  2,2,2,2,  2,2,2,2,  3,3,3,3,  3,3,3,3),
  period = c(1,1,1,1,  2,2,2,2,  1,1,1,1,  2,2,2,2,  1,1,1,1,  2,2,2,2),
  day=     c(1,2,4,5,  1,3,4,5,  2,3,4,5,  1,2,3,5,  2,3,4,5,  1,2,3,4),
  value   =c(10,12,15,16, 11,14,15,17, 13,14,15,16, 15,16,18,20,  16,17,19,29, 14,16,18,20))

For each id and period I am missing data for days 3,2,1,4,1,5, respectively. I want to expand the data to let's say 10 days and extrapolate the data on value column (e.g. with linear regression).

My final df should be something like that:

df2 <- data.frame(
  id  =    c(1,1,1,1,1,1,1,         1,1,1,1,1,1,1,         2,2,2,2,2,2,2,        2,2,2,2,2,2,2,         3,3,3,3,3,3,3,         3,3,3,3,3,3,3),
  period = c(1,1,1,1,1,1,1,         2,2,2,2,2,2,2,         1,1,1,1,1,1,1,        2,2,2,2,2,2,2,         1,1,1,1,1,1,1,         2,2,2,2,2,2,2),
  day=     c(1,2,3,4,5,6,7,         1,2,3,4,5,6,7,         1,2,3,4,5,6,7,        1,2,3,4,5,6,7,         1,2,3,4,5,6,7,         1,2,3,4,5,6,7),
  value   =c(10,12,13,15,16,17,18,  11,12,14,15,17,18,19,  12,13,14,15,16,18,22, 15,16,18,19,20,22,23,  15,16,17,19,29,39,49,  14,16,18,20,22,24,26))

The most similar example I found doesn't extrapolate by two variables (ID and period in my case), it extrapolates only by year. I tried to adapt the code but no success :(

Another example extrapolates the data by multiple id but doesn't add rows for missing data.

I couldn't combine both codes with my limited experience in R. Any suggestions? Thanks in advance...

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Murat
  • 35
  • 4

2 Answers2

1

We can use complete

library(dplyr)
library(tidyr)
library(forecast)
df %>% 
    group_by(id, period) %>% 
    complete(day =1:7)%>% 
    mutate(value = as.numeric(na.interp(value)))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This solved the first part, but couldn't extrapolate with: df3 <- df2[, value := ifelse(is.na(value), predict(lm(value ~ period, data=.SD), .SD), value), by=id] – Murat Mar 24 '20 at 19:17
  • @Murat the `na.interp` is doing the linear interpoolation – akrun Mar 24 '20 at 19:19
  • 1
    I did not realize your edit at the first look, sorry. Great solution! Thank you so much! – Murat Mar 24 '20 at 19:26
  • actually I have another problem, I realized that other columns that I do not need extrapolation (which were not included in the sample data) were not extrapolated, and I am having trouble to complete extend them. Same values should continue in the added rows. I'm not sure if I should open another question, or would be able to comment on this issue as well? – Murat Mar 26 '20 at 14:51
1

@akrun's answer is good, as long as you don't mind using linear interpolation. However, if you do want to use a linear model, you could try this data.table approach.

library(data.table)
model <- lm(value ~ day + period + id,data=df)
dt <- as.data.table(df)[,.SD[,.(day = 1:7,value = value[match(1:7,day)])],by=.(id,period)]
dt[is.na(value), value := predict(model,.SD),]
dt
    id period day    value
 1:  1      1   1 10.00000
 2:  1      1   2 12.00000
 3:  1      1   3 12.86714
 4:  1      1   4 15.00000
 5:  1      1   5 16.00000
 6:  1      1   6 18.13725
 7:  1      1   7 19.89396
 8:  1      2   1 11.00000
 9:  1      2   2 12.15545
10:  1      2   3 14.00000
11:  1      2   4 15.00000
12:  1      2   5 17.00000
13:  1      2   6 19.18227
14:  1      2   7 20.93898
15:  2      1   1 11.90102
16:  2      1   2 13.00000
17:  2      1   3 14.00000
18:  2      1   4 15.00000
19:  2      1   5 16.00000
20:  2      1   6 20.68455
21:  2      1   7 22.44125
22:  2      2   1 15.00000
23:  2      2   2 16.00000
24:  2      2   3 18.00000
25:  2      2   4 18.21616
26:  2      2   5 20.00000
27:  2      2   6 21.72957
28:  2      2   7 23.48627
29:  3      1   1 14.44831
30:  3      1   2 16.00000
31:  3      1   3 17.00000
32:  3      1   4 19.00000
33:  3      1   5 29.00000
34:  3      1   6 23.23184
35:  3      1   7 24.98855
36:  3      2   1 14.00000
37:  3      2   2 16.00000
38:  3      2   3 18.00000
39:  3      2   4 20.00000
40:  3      2   5 22.52016
41:  3      2   6 24.27686
42:  3      2   7 26.03357
    id period day    value
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57