2

I am attempting to run a time series cross validation ML tuning process on a Spark cluster (sparklyr on Databricks), but am getting an error. The packages I'm using are tidymodels with modeltime. The code works perfectly fine on a local machine, but fails on the 'worfklow_map()' function when running on Spark. The purpose of this function is to train each model on several time series 'folds', which are defined in the time_series_cv() function. I cannot debug this for the life of me because the Spark error trace is uninformative. Does anyone know why this would work locally but not on Spark? I'm somewhat new to working with clusters, so could be overlooking something simple.

If it is a package limitation, does anyone know if there an alternative way to do the 'resampling' CV on Spark where you can train each model on several non-overlapping time 'slices' in the series? Thank you in advance.

# Define CV Schema
cv_folds <- time_series_cv(
  data        = train_tbl,
  assess      = "6 months",
  initial     = "4 years",
  skip = "1 months",
  slice_limit = 20
)

# Create Preprocessing recipe
recipe_spec_lag <- recipe(formula(paste0(dv, ' ~ .')), data = train_tbl) %>%
  step_dummy(all_nominal()) %>%
  step_rm(date) %>%
  step_zv(all_predictors())

# Create hyperparameter grid
grid_tbl_xgb <- grid_regular(
  learn_rate(),
  trees(),
  levels = 3
) 

grid_tbl_xgb <- grid_tbl_xgb %>%
  create_model_grid( 
    f_model_spec = boost_tree,
    engine_name  = "spark", #also tried this with engine_name = 'xgboost'
    mode         = "regression",
    engine_params = list(max_depth = 5)
  )

# Define workflow
model_wfset <- workflow_set(
  preproc = list(
    recipe_spec_lag
  ),
  models = grid_tbl_xgb$.models, 
  cross = TRUE
) 

# Error here (works locally but fails on Databricks)
# Train models across grid and CV folds
test <- workflow_map(model_wfset, fn = "fit_resamples", resamples = cv_folds)

I get the following error

i 1 of 9 resampling: recipe_boost_tree_1

✖ 1 of 9 resampling: recipe_boost_tree_1 failed with: org.apache.spark.SparkException: 
Job aborted due to stage failure: Task 6 in stage 211.0 failed 4 times, most recent 
failure: Lost task 6.3 in stage 211.0 (TID 1510) (192.18.29.13 executor 0): 
java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs 
for details.    at sparklyr.Rscript.init(rscript.scala:83)  at 
sparklyr.WorkerApply$$anon$2.run(workerapply.scala:138)Driver stacktrace:   at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:29
84) at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2931)  
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:
2925)   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)   at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)  at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2925) at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala
:1345)  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGSchedul
er.scala:1345)  at scala.Option.foreach(Option.scala:407)   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1345)    
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:31
93) at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3134
)   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3122
)   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)Caused by: 
java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs 
for details.    at sparklyr.Rscript.init(rscript.scala:83)  at 
sparklyr.WorkerApply$$anon$2.run(workerapply.scala:138)

skklogw7
  • 41
  • 3
  • Currently, resampling in tidymodels isn't supported for Spark, mainly because finding the number of rows is not a simple task for a Spark dataframe. You may already be familiar with this, but [this is a good resource](https://spark.rstudio.com/guides/tidymodels.html) for what is possible in Spark right now. – Julia Silge Apr 14 '22 at 22:59
  • Hi @skklogw7, did you have any success with multiple timeseries forecasting with modeltime and spark? I am researching using this approach but it looks as though there's a roadblock here maybe? – TheGoat Nov 03 '22 at 00:08

0 Answers0