1

I have package a massive time series workflow (4273*10 models) for 4273 time series weekly in drake.

Originally I attempted to create the full workflow using the fable package. Which is quite handy to train models for grouped tsibbles, but after different trials I got many many problems with memory management. My RStudio server with 32 cores and 244 GBs of RAM was crashing constantly specially when I was trying to serialize the models.

Because of that I completely spitted my workflow in order to identify bottlenecks going from:

enter image description here

To:

enter image description here

Then to:

enter image description here

An finally to:

enter image description here

Inside my training code (example prophet_multiplicative) I am using the future package to train this multiple fable models and then calculate the accuracy and save them. However I am not aware how to remove this object from the drake workflow afterwards:

  • Should I just remove the object using rm?
  • Is there any way in drake to have separate environments for each of the workflow components?
  • Is this the right solution?

My idea is to run each of individual techniques in a serial manner meanwhile the 4273 models for one specific technique are trained in parallel. Doing so I expect to not crash the server and then after all of my models are trained I can join the accuracy metrics, pick the best model for each of my time series and then trim each of the individual binary files to be able to produce the forecasts.

Any suggestions to my approach are more than welcome. Please notice that I am quite constrained in hardware resources so getting a bigger server is not an option.

BR /E

tfkLSTM
  • 161
  • 13

3 Answers3

2

There is always a tradeoff between memory and speed. To conserve memory, we have to unload some targets from the session, which often requires us to take the time to read them in from storage later on. The default behavior of drake is to favor speed. So in your case, I would set memory_strategy = “autoclean” and garbage_collection = TRUE in make() and related functions. The user manual has a chapter devoted to memory management: https://books.ropensci.org/drake/memory.html.

Also, I recommend returning small targets when possible. So instead of returning an entire fitted model, you could instead return a small data frame of model summaries, which will be kinder to both memory and storage. On top of that, you could choose one of the specialized storage formats at https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets to gain even more efficiency.

landau
  • 5,636
  • 1
  • 22
  • 50
0

garbage_collection = TRUE is already set it. I will try adding autoclean. Regarding the file formats I am saving my models as .qs with the qs library using the function saving_model_x:

saveModels <- function(models, directory_out, max_forecasting_horizon, max_multisession_cores) {
print("Saving the all-mighty mable")
save(x = models, file = paste0(directory_out, attributes(models)$model, "_horizon_", max_forecasting_horizon, ".qs"), 
     nthreads = max_multisession_cores)
#saveRDS(object = models, file = paste0(directory_out, "ts_models_horizon_", max_forecasting_horizon, ".rds"))
print("End workflow")
}

In my plan this is used as:

prophet_multiplicative = trainModels(input_data = processed_data, 
                               max_forecast_horizon = argument_parser$horizon,
                               max_multisession_cores = 6,
                               model_type = "prophet_multiplicative"),
  accuracy_prophet_multiplicative = accuracy_explorer(type = "train", models = prophet_multiplicative, 
                                                      max_forecast_horizon = argument_parser$horizon,
                                                      directory_out = "/data1/my_folder/"),
  saving_prophet_multiplicative = saveModels(models = prophet_multiplicative, 
                       directory_out = "/data1/my_folder/,
                       max_forecasting_horizon = argument_parser$horizon,
                       max_multisession_cores = 6)

My plan details after your suggestion are as follows:

make(plan = plan, verbose = 2, 
     log_progress = TRUE,
     recover = TRUE,
     lock_envir = FALSE,
     garbage_collection = TRUE,
     memory_strategy = "autoclean")

Any suggestions?

BR

/E

tfkLSTM
  • 161
  • 13
  • Let me know if you still run into memory issues even after autoclean. You might also look at the "none" memory strategy if you really want to take full manual control. One of the advantages of `drake` is that it abstracts files as R objects and manages storage for you. So if you feel like it, one alternative to custom qsave() calls is `drake_plan(target(your_target, your_command(), format = "qs"))`. – landau Jul 04 '20 at 19:58
  • Then again, if you continue to run into memory issues, a completely opposite tactic from `target(format = "qs")` is to use [dynamic files](https://books.ropensci.org/drake/plans.html#dynamic-files) for everything. With dynamic files, `drake` only keeps the file path in memory, not the object itself, but it is up to you to manually read the object into memory for each target that uses it. – landau Jul 04 '20 at 19:59
  • Hi landau, for some reason my last answer was not posted after your comment. Now I am having a new problem I think that the function train models is not evaluating correctly after adding auto_clean. – tfkLSTM Jul 05 '20 at 13:48
  • Does the function behave normally outside drake? If so, would you post a reproducible example to a different thread. Sounds like an issue I will need to run myself in order to troubleshoot. – landau Jul 05 '20 at 13:53
  • Hi Landau, Yes the function works perfectly outside drake. I will add the example. – tfkLSTM Jul 05 '20 at 14:18
  • Would you open a brand new Stack Overflow question or GitHub issue for it? I do not get notified when you post a new answer/solution as a reply on this page. – landau Jul 05 '20 at 14:54
0

Thank you for the quick answers I really appreciated it. Now I am facing another problem, I let the script run at night via nohup and I found the following in the logs:

[1] "DB PROD Connected"
[1] "DB PROD Connected"
[1] "Getting RAW data"
[1] "Maximum forecasting horizon is 52, fetching weekly data"
[1] "Removing duplicates if we have them"
[1] "Original data has 1860590 rows"
[1] "Data without duplicates has 1837995 rows"
`summarise()` regrouping output by 'A', 'B' (override with `.groups` argument)
[1] "Removing non active customers"
[1] "Data without duplicates and without active customers has 1654483 rows"
0.398 sec elapsed
[1] "Removing customers with last data older than 1.5 years"
[1] "Data without duplicates, customers that are not active and old customers has 1268610 rows"
0.223 sec elapsed
[1] "Augmenting data"
12.103 sec elapsed
[1] "Creating tsibble"
7.185 sec elapsed
[1] "Filling gaps for not breaking groups"
9.568 sec elapsed
[1] "Training theta models for forecasting horizon 52"
[1] "Using 12 sessions from as future::plan()"
Repacking large object
[1] "Training auto_arima models for forecasting horizon 52"
[1] "Using 12 sessions from as future::plan()"
Error: target auto_arima failed.
diagnose(auto_arima)error$message:
  object 'ts_models' not found
diagnose(auto_arima)error$calls:
  1. └─global::trainModels(...)
In addition: Warning message:
9 errors (2 unique) encountered for theta
[3] function cannot be evaluated at initial parameters
[6] Not enough data to estimate this ETS model.

Execution halted
            

The object ts_models is the object being created in my training scripts and it is basically what my function trainModels return. Seems to me that maybe the input data parameter is being clean and that's the reason why it fails?

Another question for some reason my model gets not saved after training the thetha models. Is there any way to specify drake to do not jump to next model until it calculates the accuracy of one and save the .qs file?

My training function is as follows:

trainModels <- function(input_data, max_forecast_horizon, model_type, max_multisession_cores) {

  options(future.globals.maxSize = 1500000000)
  future::plan(multisession, workers = max_multisession_cores) #breaking infrastructure once again ;)
  set.seed(666) # reproducibility
  
    if(max_forecast_horizon <= 104) {
      
      print(paste0("Training ", model_type, " models for forecasting horizon ", max_forecast_horizon))
      print(paste0("Using ", max_multisession_cores, " sessions from as future::plan()"))
      
      if(model_type == "prophet_multiplicative") {
        
        ts_models <- input_data %>% model(prophet = fable.prophet::prophet(snsr_val_clean ~ season("week", 2, type = "multiplicative") + 
                                                                             season("month", 2, type = "multiplicative")))
        
      } else if(model_type == "prophet_additive") {
        
        ts_models <- input_data %>% model(prophet = fable.prophet::prophet(snsr_val_clean ~ season("week", 2, type = "additive") + 
                                                                             season("month", 2, type = "additive")))
        
      } else if(model_type == "auto.arima") {
        
        ts_models <- input_data %>% model(auto_arima = ARIMA(snsr_val_clean))
        
      } else if(model_type == "arima_with_yearly_fourier_components") {
        
        ts_models <- input_data %>% model(auto_arima_yf = ARIMA(snsr_val_clean ~ fourier("year", K = 2)))
        
      } else if(model_type == "arima_with_monthly_fourier_components") {
        
        ts_models <- input_data %>% model(auto_arima_mf = ARIMA(snsr_val_clean ~ fourier("month", K=2)))
        
      } else if(model_type == "regression_with_arima_errors") {
        
        ts_models <- input_data %>% model(auto_arima_mf_reg = ARIMA(snsr_val_clean ~ month + year  + quarter + qday + yday + week))
        
      } else if(model_type == "tslm") {
    
        ts_models <- input_data %>% model(tslm_reg_all = TSLM(snsr_val_clean ~ year  + quarter + month + day + qday + yday + week + trend()))
     
      } else if(model_type == "theta") {
        
        ts_models <- input_data %>% model(theta = THETA(snsr_val_clean ~ season()))
        
      } else if(model_type == "ensemble") {
        
        ts_models <- input_data %>% model(ensemble =  combination_model(ARIMA(snsr_val_clean), 
                                              ARIMA(snsr_val_clean ~ fourier("month", K=2)),
                                              fable.prophet::prophet(snsr_val_clean ~ season("week", 2, type = "multiplicative") +
                                              season("month", 2, type = "multiplicative"), 
                                              theta = THETA(snsr_val_clean ~ season()), 
                                              tslm_reg_all = TSLM(snsr_val_clean ~ year  + quarter + month + day + qday + yday + week + trend())))
            )
        
      }
      
    } 
  
    else if(max_forecast_horizon > 104) {
      
        print(paste0("Training ", model_type, " models for forecasting horizon ", max_forecast_horizon))
        print(paste0("Using ", max_multisession_cores, " sessions from as future::plan()"))
        
        
        if(model_type == "prophet_multiplicative") {
          
          ts_models <- input_data %>% model(prophet = fable.prophet::prophet(snsr_val_clean ~ season("month", 2, type = "multiplicative") + 
                                                                               season("month", 2, type = "multiplicative")))
          
        } else if(model_type == "prophet_additive") {
          
          ts_models <- input_data %>% model(prophet = fable.prophet::prophet(snsr_val_clean ~ season("month", 2, type = "additive") + 
                                                                               season("year", 2, type = "additive")))
          
        } else if(model_type == "auto.arima") {
          
          ts_models <- input_data %>% model(auto_arima = ARIMA(snsr_val_clean))
          
        } else if(model_type == "arima_with_yearly_fourier_components") {
          
          ts_models <- input_data %>% model(auto_arima_yf = ARIMA(snsr_val_clean ~ fourier("year", K = 2)))
          
        } else if(model_type == "arima_with_monthly_fourier_components") {
          
          ts_models <- input_data %>% model(auto_arima_mf = ARIMA(snsr_val_clean ~ fourier("month", K=2)))
          
        } else if(model_type == "regression_with_arima_errors") {
          
          ts_models <- input_data %>% model(auto_arima_mf_reg = ARIMA(snsr_val_clean ~ month + year  + quarter + qday + yday))
          
        } else if(model_type == "tslm") {
          
          ts_models <- input_data %>% model(tslm_reg_all = TSLM(snsr_val_clean ~ year  + quarter + month + day + qday + yday + trend()))
          
        } else if(model_type == "theta") {
          
          ts_models <- input_data %>% model(theta = THETA(snsr_val_clean ~ season()))
          
        } else if(model_type == "ensemble") {
          
          ts_models <- input_data %>% model(ensemble =  combination_model(ARIMA(snsr_val_clean), 
                                                                          ARIMA(snsr_val_clean ~ fourier("month", K=2)),
                                                                          fable.prophet::prophet(snsr_val_clean ~ season("month", 2, type = "multiplicative") +
                                                                          season("year", 2, type = "multiplicative"),
                                                                          theta = THETA(snsr_val_clean ~ season()), 
                                                                          tslm_reg_all = TSLM(snsr_val_clean ~ year  + quarter + month + day + qday + 
                                                                                                yday  + trend())))
          )
          
        }
    }
  
  return(ts_models)
}

BR /E

tfkLSTM
  • 161
  • 13
  • Just seeing this now. To troubleshoot, it would really help to see a scaled down version of the entire project which reproduces the error. The function looks good at a glance, but I also need to see the plan and other contextual code and be able to run the whole thing myself. – landau Jul 05 '20 at 19:00
  • Hi Landau: https://github.com/ropensci/drake/issues/1293, plan in detail there – tfkLSTM Jul 05 '20 at 19:12
  • Thanks, I will take a look. – landau Jul 05 '20 at 19:22
  • Thanks to you, really awesome support, very happy to use drake! – tfkLSTM Jul 05 '20 at 19:24