Join residual by group in summary Forecast table in R

Question

reproducible example

df=structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
                            2L, 2L, 2L), year = c(1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 
                                                  1973L, 1974L, 1975L, 1976L, 1977L, 1978L), Jan = c(9007L, 7750L, 
                                                                                                     8162L, 7717L, 7792L, 7836L, 9007L, 7750L, 8162L, 7717L, 7792L, 
                                                                                                     7836L), Feb = c(8106L, 6981L, 7306L, 7461L, 6957L, 6892L, 8106L, 
                                                                                                                     6981L, 7306L, 7461L, 6957L, 6892L), Mar = c(8928L, 8038L, 8124L, 
                                                                                                                                                                 7767L, 7726L, 7791L, 8928L, 8038L, 8124L, 7767L, 7726L, 7791L
                                                                                                                     ), Apr = c(9137L, 8422L, 7870L, 7925L, 8106L, 8192L, 9137L, 8422L, 
                                                                                                                                7870L, 7925L, 8106L, 8192L), May = c(10017L, 8714L, 9387L, 8623L, 
                                                                                                                                                                     8890L, 9115L, 10017L, 8714L, 9387L, 8623L, 8890L, 9115L), Jun = c(10826L, 
                                                                                                                                                                                                                                       9512L, 9556L, 8945L, 9299L, 9434L, 10826L, 9512L, 9556L, 8945L, 
                                                                                                                                                                                                                                       9299L, 9434L), Jul = c(11317L, 10120L, 10093L, 10078L, 10625L, 
                                                                                                                                                                                                                                                              10484L, 11317L, 10120L, 10093L, 10078L, 10625L, 10484L), Aug = c(10744L, 
                                                                                                                                                                                                                                                                                                                               9823L, 9620L, 9179L, 9302L, 9827L, 10744L, 9823L, 9620L, 9179L, 
                                                                                                                                                                                                                                                                                                                               9302L, 9827L), Sep = c(9713L, 8743L, 8285L, 8037L, 8314L, 9110L, 
                                                                                                                                                                                                                                                                                                                                                      9713L, 8743L, 8285L, 8037L, 8314L, 9110L), Oct = c(9938L, 9129L, 
                                                                                                                                                                                                                                                                                                                                                                                                         8466L, 8488L, 8850L, 9070L, 9938L, 9129L, 8466L, 8488L, 8850L, 
                                                                                                                                                                                                                                                                                                                                                                                                         9070L), Nov = c(9161L, 8710L, 8160L, 7874L, 8265L, 8633L, 9161L, 
                                                                                                                                                                                                                                                                                                                                                                                                                         8710L, 8160L, 7874L, 8265L, 8633L), Dec = c(8927L, 8680L, 8034L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                     8647L, 8796L, 9240L, 8927L, 8680L, 8034L, 8647L, 8796L, 9240L
                                                                                                                                                                                                                                                                                                                                                                                                                         )), .Names = c("group", "year", "Jan", "Feb", "Mar", "Apr", "May", 
                                                                                                                                                                                                                                                                                                                                                                                                                                        "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -12L))

Perfrom Forecat by group

library(forecast)
ld <- split(df[, -1], df$group)
ld <- lapply(ld, function(x) {ts(c(t(x[,-1])), start = min(x[,1]), frequency = 12)})

lts <- lapply(ld, ets, model = "ZZZ")

So result

$`1`
         Point Forecast     Lo 80     Hi 80    Lo 95     Hi 95
Jan 1979       8397.497  8022.399  8772.595 7823.834  8971.160
Feb 1979       7599.221  7162.825  8035.616 6931.812  8266.630
Mar 1979       8396.595  7906.510  8886.679 7647.075  9146.115
Apr 1979       8646.510  8108.063  9184.957 7823.026  9469.994

From 1979 year, it is forecasted values, i want get result of residuals for 1973-1978.(initiall values)

res <- lapply(lts, residuals)

and the result

$`1`
            Jan        Feb        Mar        Apr        May        Jun        Jul        Aug        Sep        Oct        Nov
1973  497.69233   99.50607   64.44947  -15.20925   77.85009  390.89045 -277.67369   26.92614   72.42590  -85.69894 -338.10035

and so on

Questions 1. How result of residual to join in summary table. For example something like this

Question: For 1979 and more we see forecasted value, but for 1973-1978 in the column point forecast we see the residuals. Ideally, of course, get not so much residual, but the original values and forecasted values. So i don't know how for initiall data 1973-1978 join in summary table original values something like this df[df$year == 1973,]but how for all year... Then from original values subtract residial and got forecasted value (Maybe I complicate the task a lot, but otherwise I don’t know how to get the desired output)

colnames point forecast,lo80 and hi80 is not need for changing, i will be remember that for initial values they mean residual, original and forecasted.

Is it possible to do it using dplyr or data.table solution?

# Tidy-up the splits
ld <- lapply(ld, function(x) {
  x %>%
    gather(key, value, -year) %>%
    unite(date, year, key, sep = "-") %>%
    mutate(date = paste0(date, "-01")) %>%
    mutate(date = as.Date(date, format = "%Y-%b-%d"))    
})

the result

$`1`
   date value
1  <NA>  9007
2  <NA>  7750
3  <NA>  8162
4  <NA>  7717
5  <NA>  7792
6  <NA>  7836
7  <NA>  8106
8  <NA>  6981
9  <NA>  7306
10 <NA>  7461
11 <NA>  6957
12 <NA>  6892


ld=dput()
ld <- lapply(ld, function(x) {
  yr <- lubridate::year(min(x$date))
  mth <- lubridate::month(min(x$date))
  timetk::tk_ts(data = x, select = value, frequency = 12,
                start = c(yr, mth))
})

error

 Error in x$date : $ operator is invalid for atomic vectors

edit3

> lts_all <- lapply(names(lts), function(x, lts) {
+   output_fit <- lts[[x]][["res_fit_tbl"]] %>%
+     mutate(group = x)
+   output_fcst <- lts[[x]][["res_fcst_tbl"]] %>%
+     mutate(group = x)
+   
+   return(list(output_fit=output_fit, output_fcst=output_fcst))
+ }, lts)
> lts_all
[[1]]
[[1]]$output_fit
# A tibble: 72 x 6
   date       value residuals CI95_upper CI95_lower group
   <date>     <dbl>     <dbl>      <dbl>      <dbl> <chr>
 1 1973-01-01  8509     498         9083       7936 value
 2 1973-02-01  8006      99.5       8580       7433 value
 3 1973-03-01  8864      64.4       9437       8290 value
 4 1973-04-01  9152    - 15.2       9726       8579 value
 5 1973-05-01  9939      77.9      10513       9365 value
 6 1973-06-01 10435     391        11009       9861 value
 7 1973-07-01 11595    -278        12168      11021 value
 8 1973-08-01 10717      26.9      11291      10143 value
 9 1973-09-01  9641      72.4      10214       9067 value
10 1973-10-01 10024    - 85.7      10597       9450 value
# ... with 62 more rows

NRLP · Accepted Answer · 2018-12-19T09:53:24.910

Here it is a complete solution starting from df, the reproducible example:

# load libraries
load_pkgs <- c("forecast", "zoo", "timetk", "tidyverse") 
sapply(load_pkgs, function(x) suppressPackageStartupMessages(library(x, character.only = T)))

Step 1: Pre-processing

# perform split by group
ld <- split(df[, -1], df$group)

# Tidy-up the splits
ld <- lapply(ld, function(x) {
  x %>%
    gather(key, value, -year) %>%
    unite(date, year, key, sep = "-") %>%
    mutate(date = paste0(date, "-01")) %>%
    mutate(date = as.Date(date, format = "%Y-%b-%d"))    
})

dput first ts:

structure(list(date = structure(c(1096, 1461, 1826, 2191, 2557, 
                                  2922, 1127, 1492, 1857, 2222, 2588, 2953, 1155, 1520, 1885, 2251, 
                                  2616, 2981, 1186, 1551, 1916, 2282, 2647, 3012, 1216, 1581, 1946, 
                                  2312, 2677, 3042, 1247, 1612, 1977, 2343, 2708, 3073, 1277, 1642, 
                                  2007, 2373, 2738, 3103, 1308, 1673, 2038, 2404, 2769, 3134, 1339, 
                                  1704, 2069, 2435, 2800, 3165, 1369, 1734, 2099, 2465, 2830, 3195, 
                                  1400, 1765, 2130, 2496, 2861, 3226, 1430, 1795, 2160, 2526, 2891, 
                                  3256), class = "Date"), value = c(9007L, 7750L, 8162L, 7717L, 
                                                                    7792L, 7836L, 8106L, 6981L, 7306L, 7461L, 6957L, 6892L, 8928L, 
                                                                    8038L, 8124L, 7767L, 7726L, 7791L, 9137L, 8422L, 7870L, 7925L, 
                                                                    8106L, 8192L, 10017L, 8714L, 9387L, 8623L, 8890L, 9115L, 10826L, 
                                                                    9512L, 9556L, 8945L, 9299L, 9434L, 11317L, 10120L, 10093L, 10078L, 
                                                                    10625L, 10484L, 10744L, 9823L, 9620L, 9179L, 9302L, 9827L, 9713L, 
                                                                    8743L, 8285L, 8037L, 8314L, 9110L, 9938L, 9129L, 8466L, 8488L, 
                                                                    8850L, 9070L, 9161L, 8710L, 8160L, 7874L, 8265L, 8633L, 8927L, 
                                                                    8680L, 8034L, 8647L, 8796L, 9240L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                             -72L))

Then

# Transform time series to ts objects
ld <- lapply(ld, function(x) {
  yr <- lubridate::year(min(x$date))
  mth <- lubridate::month(min(x$date))
  timetk::tk_ts(data = x, select = value, frequency = 12,
                start = c(yr, mth))
})

dput first ts:

structure(c(9007L, 8106L, 8928L, 9137L, 10017L, 10826L, 11317L, 
            10744L, 9713L, 9938L, 9161L, 8927L, 7750L, 6981L, 8038L, 8422L, 
            8714L, 9512L, 10120L, 9823L, 8743L, 9129L, 8710L, 8680L, 8162L, 
            7306L, 8124L, 7870L, 9387L, 9556L, 10093L, 9620L, 8285L, 8466L, 
            8160L, 8034L, 7717L, 7461L, 7767L, 7925L, 8623L, 8945L, 10078L, 
            9179L, 8037L, 8488L, 7874L, 8647L, 7792L, 6957L, 7726L, 8106L, 
            8890L, 9299L, 10625L, 9302L, 8314L, 8850L, 8265L, 8796L, 7836L, 
            6892L, 7791L, 8192L, 9115L, 9434L, 10484L, 9827L, 9110L, 9070L, 
            8633L, 9240L), .Dim = c(72L, 1L), .Dimnames = list(NULL, "value"), index = structure(c(94694400, 
                                                                                                   97372800, 99792000, 102470400, 105062400, 107740800, 110332800, 
                                                                                                   113011200, 115689600, 118281600, 120960000, 123552000, 126230400, 
                                                                                                   128908800, 131328000, 134006400, 136598400, 139276800, 141868800, 
                                                                                                   144547200, 147225600, 149817600, 152496000, 155088000, 157766400, 
                                                                                                   160444800, 162864000, 165542400, 168134400, 170812800, 173404800, 
                                                                                                   176083200, 178761600, 181353600, 184032000, 186624000, 189302400, 
                                                                                                   191980800, 194486400, 197164800, 199756800, 202435200, 205027200, 
                                                                                                   207705600, 210384000, 212976000, 215654400, 218246400, 220924800, 
                                                                                                   223603200, 226022400, 228700800, 231292800, 233971200, 236563200, 
                                                                                                   239241600, 241920000, 244512000, 247190400, 249782400, 252460800, 
                                                                                                   255139200, 257558400, 260236800, 262828800, 265507200, 268099200, 
                                                                                                   270777600, 273456000, 276048000, 278726400, 281318400), tzone = "UTC", tclass = "Date"), .indexCLASS = "Date", tclass = "Date", .indexTZ = "UTC", tzone = "UTC", class = "ts", .Tsp = c(1973, 
                                                                                                                                                                                                                                                                                           1978.91666666667, 12))

Step 2: Train and forecast with ets

You need a helping function to transform your output to data frame:

# helping function
make_df <- function(ts_obj) {

  ts_df <- timetk::tk_tbl(preserve_index = TRUE, ts_obj) %>%
    mutate(index = zoo::as.Date(x = .$index, frac = 0)) %>% 
    dplyr::rename(date = index)

  return(ts_df)
}

The following function trains ets and forecasts the next 12 months; then, it prepares tables with the fitted and forecasting values:

lts <- lapply(ld, function(ts_obj) {
# train ets model and get fitted results
res_model <- ets(ts_obj, model = "ZZZ")
res_fit <- ts(as.numeric(res_model$fitted), start = start(ts_obj), frequency = 12)

# add extra metrics you may be interested in
model <- res_model[["method"]]
mse <- res_model[["mse"]]

# get forecasts for the next 12 months
res_fct <- forecast(res_model, h = 12)
res_fcst <- ts(res_fct$mean, start = end(ts_obj) + 1/12, frequency = 12)

# transform results to tbl
# for fitted output we keep the residuals and the 95% CI
res_fit_tbl <- make_df(res_fit) %>%
  mutate(residuals = as.numeric(res_model[["residuals"]])) %>%
  mutate(CI95_upper = value + 1.96*sqrt(res_model$sigma2), 
         CI95_lower = value - 1.96*sqrt(res_model$sigma2))
# the forecast output does not have residuals
res_fcst_tbl <- make_df(res_fcst)

return(list(res_fit_tbl = res_fit_tbl, res_fcst_tbl = res_fcst_tbl, model = model, mse = mse)) # don't forget to pass the extra metrics as output
})

Step 3: Bring together the fitted and forecasting outputs across different groups

# add groups back + other metrics of interest
lts_all <- lapply(names(lts), function(x, lts) {
  output_fit <- lts[[x]][["res_fit_tbl"]] %>%
    mutate(group = x,
           model = lts[[x]][["model"]],
           mse = lts[[x]][["mse"]])
  output_fcst <- lts[[x]][["res_fcst_tbl"]] %>%
    mutate(group = x)

  return(list(output_fit=output_fit, output_fcst=output_fcst))
  }, lts)

Sample output:

> lts_all[[1]][["output_fit"]]
# A tibble: 72 x 6
   date        value residuals CI95_upper CI95_lower group
   <date>      <dbl>     <dbl>      <dbl>      <dbl> <chr>
 1 1973-01-01  8509.     498.       9083.      7936. 1    
 2 1973-02-01  8006.      99.5      8580.      7433. 1    
 3 1973-03-01  8864.      64.4      9437.      8290. 1    
 4 1973-04-01  9152.     -15.2      9726.      8579. 1    
 5 1973-05-01  9939.      77.9     10513.      9365. 1    
 6 1973-06-01 10435.     391.      11009.      9861. 1    
 7 1973-07-01 11595.    -278.      12168.     11021. 1    
 8 1973-08-01 10717.      26.9     11291.     10143. 1    
 9 1973-09-01  9641.      72.4     10214.      9067. 1    
10 1973-10-01 10024.     -85.7     10597.      9450. 1    
# ... with 62 more rows

> lts_all[[2]][["output_fit"]]
# A tibble: 72 x 6
   date        value residuals CI95_upper CI95_lower group
   <date>      <dbl>     <dbl>      <dbl>      <dbl> <chr>
 1 1973-01-01  8509.     498.       9083.      7936. 2    
 2 1973-02-01  8006.      99.5      8580.      7433. 2    
 3 1973-03-01  8864.      64.4      9437.      8290. 2    
 4 1973-04-01  9152.     -15.2      9726.      8579. 2    
 5 1973-05-01  9939.      77.9     10513.      9365. 2    
 6 1973-06-01 10435.     391.      11009.      9861. 2    
 7 1973-07-01 11595.    -278.      12168.     11021. 2    
 8 1973-08-01 10717.      26.9     11291.     10143. 2    
 9 1973-09-01  9641.      72.4     10214.      9067. 2    
10 1973-10-01 10024.     -85.7     10597.      9450. 2    
# ... with 62 more rows

Then

# bring together the fitted respectively forecasting results
output_fit_all <- lapply(lts_all, function(x) x[[1]])
output_fit_all <- bind_rows(output_fit_all)

output_fcst_all <- lapply(lts_all, function(x) x[[2]])
output_fcst_all <- bind_rows(output_fcst_all)

after # Tidy-up the splits where date i got NA. Can you provide your output and # Transform time series to ts objects error Error in if (start > end) stop("'start' cannot be after 'end'") : missing value where TRUE/FALSE needed — psysky, Dec 18 '18 at 08:17
The code runs on the reproducible example you have provided. What I suspect is that you are using it on your entire dataset, which might need some further cleaning. Do you have time series with missing years and/or months? This should not be the case (we need complete data for forecasting). One cause for the error you mention might be that your Date is not correctly formatted. For instance, are all years in the format 'yyyy'? — NRLP, Dec 18 '18 at 09:04
i use my reproducible example that i provided here. df. after i run # Tidy-up the splits, Date column is NA.Why date column became NA? — psysky, Dec 18 '18 at 09:11
i edietd post, please check, here result # Tidy-up the splits — psysky, Dec 18 '18 at 09:13
Very strange. It works for me. I'll edit the post with a `dput` for the first ts. Maybe you have done other transformations and you need to restart your session? — NRLP, Dec 18 '18 at 09:17
i run your dput(Date column is good), but when run # Transform time series to ts objects Error in x$date : $ operator is invalid for atomic vectors — psysky, Dec 18 '18 at 09:43
provide, please dput after # Transform time series to ts objects. I think trouble on my site. — psysky, Dec 18 '18 at 09:44
I have added `dput` for the first ts after # transform time series to ts objects — NRLP, Dec 18 '18 at 09:51
your code works with dput , but i don't know, why # Tidy-up the splits and # Transform time series to ts objects don't work on my df, when i self run it. I will think, now, last question why when i retrun group, i just see froup value. Please check# edit3 in my post. — psysky, Dec 18 '18 at 14:05
I added a sample output in the answer. Here the group values are correct. — NRLP, Dec 18 '18 at 14:19
No, it works for multiple groups (please see edited answer with a sample from the second group). — NRLP, Dec 18 '18 at 14:24
,I am sorry for bothering you with my questions, so you have helped me very coolly, but can you tell me , ets function shows the type of model and the metrics, how can i get it for each group? res_model <- ets(ts_obj, model = "ZZZ") Can you edit post — psysky, Dec 19 '18 at 09:00
I updated the answer to add extra metrics (similarly, you may add other metrics that interest you) — NRLP, Dec 19 '18 at 09:54
If this is not difficult for you, can you prompt me, I wanted to change the code in order to get results not of exponential smoothing, but auto.arima. res_model <- auto.arima(ts_obj), but i also get resut of exponential smoothing. Why? How to modify code to get result of auto.arima and metrics(p,d,q)? — psysky, Dec 21 '18 at 10:02
hello. May i ask you to modify this your code in this topic? https://stackoverflow.com/questions/54093075/adaptation-the-forecast-code-for-several-variables-in-r — psysky, Jan 09 '19 at 13:02
I feel this was a question with dedication ;) I answered it, but I think it is important for you to take ownership of your project. — NRLP, Jan 09 '19 at 15:36

Join residual by group in summary Forecast table in R

edit3

1 Answers1

Linked