3

I have data for the first 7 months of a year, and would like to linearly extrapolate the data for the last 5 months. I have tried using the approxExtrap function from the Hmisc package but am not sure how to employ it, particularly what to specify for xout. I'm open to any solution (dplyr-based would be ideal if possible). Thank you for your time.

Here is a sample of the data:


df <- tibble(pop = c(18968, 18956, 18946, 18934, 18923, 18912, 18901, NA, NA, NA, NA, NA),
                       date = c("2020-01-01", "2020-02-01", "2020-03-01", 
                                "2020-04-01", "2020-05-01", "2020-06-01", 
                                "2020-07-01", "2020-08-01", "2020-09-01", 
                                "2020-10-01", "2020-11-01", "2020-12-01"))
df$date <- lubridate::as_date(df$date)

dd_data
  • 93
  • 5

3 Answers3

2

If you want linear interpolation, lm will create a fitted object and you can predict from it.

na is the new data passed on to predict.lm as argument newdata.

df <- data.frame(pop = c(18968, 18956, 18946, 18934, 18923, 18912, 18901, NA, NA, NA, NA, NA),
                 date = c("2020-01-01", "2020-02-01", "2020-03-01", 
                          "2020-04-01", "2020-05-01", "2020-06-01", 
                          "2020-07-01", "2020-08-01", "2020-09-01", 
                          "2020-10-01", "2020-11-01", "2020-12-01"))
df$date <- lubridate::as_date(df$date)

fit <- lm(pop ~ date, df)
na <- df[is.na(df$pop), "date", drop = FALSE]
newpop <- predict(fit, newdata = na)
na <- cbind(na, pop = newpop)
na
#>          date      pop
#> 8  2020-08-01 18889.45
#> 9  2020-09-01 18878.06
#> 10 2020-10-01 18867.03
#> 11 2020-11-01 18855.64
#> 12 2020-12-01 18844.61

plot(pop ~ date, df, ylim = c(18800, 19000), pch = 19)
points(pop ~ date, na, col = "red", pch = 19)
abline(fit)

Created on 2022-12-20 with reprex v2.0.2

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you. I should have included this in my question, but I actually have the same data across many age group and sex combinations. I grouped the data (using group_by) before implementing this solution, but the age_group and sex variables were gone in the resulting na data frame. Do you know how I could retain those? – dd_data Dec 20 '22 at 18:37
2

Run lm and predict and then use coalesce to combine the known and predicted values.

library(dplyr)
df %>%
  mutate(pop2 = coalesce(pop, predict(lm(pop ~ date), across(date))))

giving the following where pop2 is pop with the NA's filled in with predicted values.

# A tibble: 12 × 3
     pop date         pop2
   <dbl> <date>      <dbl>
 1 18968 2020-01-01 18968 
 2 18956 2020-02-01 18956 
 3 18946 2020-03-01 18946 
 4 18934 2020-04-01 18934 
 5 18923 2020-05-01 18923 
 6 18912 2020-06-01 18912 
 7 18901 2020-07-01 18901 
 8    NA 2020-08-01 18889.
 9    NA 2020-09-01 18878.
10    NA 2020-10-01 18867.
11    NA 2020-11-01 18856.
12    NA 2020-12-01 18845.
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you. I like this solution, however I realize I did not ask the question to reflect what I really need. I've posted a follow up question [here](https://stackoverflow.com/questions/74867473/linear-interpolating-and-extrapolating-across-dates-in-r). When I try to implement this in my current data (which I share in the linked question), something goes wrong and the predicted values are not consistent with the actual data. – dd_data Dec 20 '22 at 18:36
  • When I run the code in this answer with the data in your other post the line goes right through the data points so either you have made an error or else the data does not well represent what you have. With the data you have shown the following is not needed but if you want to try something else try using na.approx from the zoo package to fill in the internal NAs and then use the extrapolation here to fill in the rest. – G. Grothendieck Dec 20 '22 at 18:46
  • You're right. I see now that the problem arises when I have more than 2 actual data points. I've updated the question to reflect that. Thanks again. – dd_data Dec 20 '22 at 20:03
  • Try `library(zoo); df %>% mutate(pop2 = na.spline(pop, date))` – G. Grothendieck Dec 20 '22 at 20:23
  • Thank you - what worked best was using na.approx to fill the internal values and then using na.spline to fill the few external values. When only asked to extrapolate out from one point, na.spline appears to do so linearly, which is what I was looking for. – dd_data Dec 22 '22 at 01:58
0

In case you want to use approxExtrap() function, xout is a numeric vector of x values where you want to put the extrapolated y values. In your case, it is the 8th to 12th values of date column.

extrap_8to12 <- approxExtrap(df$date, df$pop, xout = df$date[8:12]) 
df$pop[8:12] <- extrap_8to12$y

df
# A tibble: 12 × 2
     pop date      
   <dbl> <date>    
 1 18968 2020-01-01
 2 18956 2020-02-01
 3 18946 2020-03-01
 4 18934 2020-04-01
 5 18923 2020-05-01
 6 18912 2020-06-01
 7 18901 2020-07-01
 8 18901 2020-08-01
 9 18901 2020-09-01
10 18901 2020-10-01
11 18901 2020-11-01
12 18901 2020-12-01
Abdur Rohman
  • 2,691
  • 2
  • 7
  • 12
  • I had tried this but all extrapolated values are equal to the last non-NA. – Rui Barradas Dec 20 '22 at 07:49
  • 1
    Yes, that's true. My sole purpose here is to explain how to use this function because OP said "not sure how to employ it, particularly what to specify for xout". – Abdur Rohman Dec 20 '22 at 08:46