3

I have the following codes for creating a tidymodels workflow with lightgbm model. However, there is some problem when I tried to save into a .rds object and predict

library(AmesHousing)
library(treesnip)
library(lightgbm)
library(tidymodels)
tidymodels_prefer()

### Model ###

# data
data <- make_ames() %>%
  janitor::clean_names()

data <- subset(data, select = c(sale_price, bedroom_abv_gr, bsmt_full_bath, bsmt_half_bath, enclosed_porch, fireplaces,
                                full_bath, half_bath, kitchen_abv_gr, garage_area, garage_cars, gr_liv_area, lot_area,
                                lot_frontage, year_built, year_remod_add, year_sold))

data$id <- c(1:nrow(data))

data <- data %>%
  mutate(id = as.character(id)) %>%
  select(id, everything())

# model specification

lgbm_model <- boost_tree(
  mtry = 7,
  trees = 347,
  min_n = 10,
  tree_depth = 12,
  learn_rate = 0.0106430579211173,
  loss_reduction = 0.000337948798058139,
) %>%
  set_mode("regression") %>%
  set_engine("lightgbm", objective = "regression")

# recipe and workflow

lgbm_recipe <- recipe(sale_price ~., data = data) %>%
  update_role(id, new_role = "ID") %>%
  step_corr(all_predictors(), threshold = 0.7) %>%
  prep()

lgbm_workflow <- workflow() %>% 
  add_recipe(lgbm_recipe) %>%
  add_model(lgbm_model)  
  
# fit workflow

fit_lgbm_workflow <- lgbm_workflow %>%
  fit(data = data)

# predict

data_predict <- subset(data, select = -c(sale_price))
predict(fit_lgbm_workflow, new_data = data_predict)


### CASE 1: Save the workflow with SaveRDS()

saveRDS(object = fit_lgbm_workflow, file = "lgbm_workflow.rds")
new_lgbm_workflow <- readRDS(file = "lgbm_workflow.rds")

# Predict - error: Attempting to use a Booster which no longer exists

predict(new_lgbm_workflow, new_data = data_predict)



### CASE 2: Save the workflow and the fitted model separately

fitted_model <- (fit_lgbm_workflow %>% extract_fit_parsnip())$fit
saveRDS(object = fit_lgbm_workflow, file = "lgbm_workflow.rds")
lightgbm::saveRDS.lgb.Booster(object = fitted_model, file = "lgbm_model.rds")


new_lgbm_workflow <- readRDS(file = "lgbm_workflow.rds")
new_lgbm_model <- lightgbm::readRDS.lgb.Booster(file = "lgbm_model.rds")
new_lgbm_workflow$fit$fit <- new_lgbm_model


# Predict - error: cannot predict on data of class ‘tbl_df’‘tbl’‘data.frame’

predict(new_lgbm_workflow, new_data = data_predict)

Only workflows with lightgbm model seem to have this problem. For other types of models (random forest, xgboost, glm, etc), I can save the fitted workflow with saveRDS(), read with readRDS(), and predict using new data just fine

For Case 2, apparently the underlying predict function will be changed to predict.lgb.Booster(), which take a matrix as input. But my id variable has character format whereas all columns in a matrix must have the same format

Is there a way to save the entire workflow for future use?

  • anecdotally, I've never run into issues when using `readr::write_rds()` to save workflow objects - maybe try giving that function a shot – Mark Rieke May 01 '22 at 17:05
  • I haven't had much luck with models from the treesnip package, unfortunately. – Julia Silge May 03 '22 at 20:52
  • @griffinwings Did you ever solve this? I am running into the exact same issue. Its a shame because this modeling type is so much faster and accurate than XGBoost. – nate-m Jul 07 '22 at 20:56
  • @JuliaSilge do you all think you'll do a write-up on best practices with LightGBM via tidymodels/bonsai? – nate-m Jul 07 '22 at 20:57
  • @MarkRieke I was hoping moving to the bonsai package from treesnip would solve this and allow us to use write_rds natively, but no luck. I can write out no problem, the problem lies when you try to read it back in. – nate-m Jul 07 '22 at 20:57
  • @nate-m Sadly up until this day I still haven't found a way around this problem yet. – griffinwings Aug 01 '22 at 09:54
  • @griffinwings thanks for the response - that is very frustrating. – nate-m Aug 03 '22 at 18:51

2 Answers2

2

After much digging, I found the solution in this closed issue.

library(tidymodels)
#> Warning: package 'tidymodels' was built under R version 4.2.1
#> Warning: package 'broom' was built under R version 4.2.1
#> Warning: package 'scales' was built under R version 4.2.1
#> Warning: package 'infer' was built under R version 4.2.1
#> Warning: package 'modeldata' was built under R version 4.2.1
#> Warning: package 'parsnip' was built under R version 4.2.1
#> Warning: package 'rsample' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
#> Warning: package 'workflows' was built under R version 4.2.1
#> Warning: package 'workflowsets' was built under R version 4.2.1
library(bonsai)
library(lightgbm)
#> Warning: package 'lightgbm' was built under R version 4.2.1
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

# data

data <- modeldata::ames %>%
  janitor::clean_names()

data <- subset(data, select = c(sale_price, bedroom_abv_gr, bsmt_full_bath, bsmt_half_bath, enclosed_porch, fireplaces,
                                full_bath, half_bath, kitchen_abv_gr, garage_area, garage_cars, gr_liv_area, lot_area,
                                lot_frontage, year_built, year_remod_add, year_sold))

data$id <- c(1:nrow(data))

data <- data %>%
  mutate(id = as.character(id)) %>%
  select(id, everything())

# model specification

lgbm_model <- boost_tree(
  mtry = 7,
  trees = 347,
  min_n = 10,
  tree_depth = 12,
  learn_rate = 0.0106430579211173,
  loss_reduction = 0.000337948798058139,
) %>%
  set_mode("regression") %>%
  set_engine("lightgbm", objective = "regression")

# recipe and workflow

lgbm_recipe <- recipe(sale_price ~., data = data) %>%
  update_role(id, new_role = "ID") %>%
  step_corr(all_predictors(), threshold = 0.7)

lgbm_workflow <- workflow(preprocessor = lgbm_recipe,
                          spec = lgbm_model)

# fit workflow

fit_lgbm_workflow <- lgbm_workflow %>%
  fit(data = data)

# predict

data_predict <- subset(data, select = -c(sale_price))
predict(fit_lgbm_workflow, new_data = data_predict)
#> # A tibble: 2,930 × 1
#>      .pred
#>      <dbl>
#>  1 201911.
#>  2 124695.
#>  3 138983.
#>  4 221095.
#>  5 198972.
#>  6 188613.
#>  7 198730.
#>  8 170893.
#>  9 243899.
#> 10 196875.
#> # … with 2,920 more rows

# save the trained workflow and lgb.booster object separately

saveRDS(fit_lgbm_workflow, "lgbm_wflw.rds")
saveRDS.lgb.Booster(extract_fit_engine(fit_lgbm_workflow), "lgbm_booster.rds")

# load trained workflow and merge it with lgb.booster

new_lgbm_wflow <- readRDS("lgbm_wflw.rds")
new_lgbm_wflow$fit$fit$fit <- readRDS.lgb.Booster("lgbm_booster.rds")

predict(new_lgbm_wflow, data_predict)
#> # A tibble: 2,930 × 1
#>      .pred
#>      <dbl>
#>  1 201911.
#>  2 124695.
#>  3 138983.
#>  4 221095.
#>  5 198972.
#>  6 188613.
#>  7 198730.
#>  8 170893.
#>  9 243899.
#> 10 196875.
#> # … with 2,920 more rows

Created on 2022-09-07 with reprex v2.0.2

In my reprex above, I've used a workflow to fit. If you're using a parsnip object to fit, use this approach instead:


saveRDS(bonsai_fit, path1)
saveRDS.lgb.Booster(extract_fit_engine(bonsai_fit), path2)
bonsai_fit_read <- readRDS(path1)
bonsai_fit_engine_read <- readRDS.lgb.Booster(path2)
bonsai_fit_read$fit <- bonsai_fit_engine_read

Refer to this comment for more details.

The silver lining is:

Just want to add to this conversation that since December 2021, {lightgbm}'s development version has supported using readsRDS() / saveRDS() directly for {lightgbm} models

Desmond
  • 1,047
  • 7
  • 14
  • Awesome! I think this is mostly it. A couple minor tweaks, saveRDS has been deprecated so we need to use `lightgbm::lgb.save` and `lightgbm::lgb.load`. For my workflow (using select_best after tuning > finalize_workflow > last_fit) the booster lives here: `new_lgbm_wflow$.workflow[[1]]$fit$fit$fit`. Using these tweaks and your logic I was able to load in and predict without a problem! – nate-m Sep 09 '22 at 23:54
1

I figured out a solution to saving out lightgbm for future reference. It doesn't use the tidymodel framework, but instead you are forced to covert it into lightgbm model format first. The same is true if you want to evaluate variable importance.

Based on the above code:

# Convert to lightgbm booster model
lgb_model <- parsnip::extract_fit_engine(fit_lgbm_workflow) 

# If you want you can now evaluate variable importance. 
# Tidymodels does not support variable importance of lgb via bonsai currently

loss_varimp <- lgb_model %>%
    lgb.importance(.) 

# Save the booster out
lightgbm::lgb.save(lgb_model, filename_x)

# Read the booster in
lightgbm::lgb.load(filename_x)

I haven't figured out whether you can merge the loaded lightgbm back into a tidymodel format, but now you can at least predict, use, and evaluate without having to re-run the model each time. Hope this helps and please post if you found a cleaner/more current solution!

nate-m
  • 557
  • 3
  • 14
  • Thanks for sharing this solution. Saving the model to lgb format, however, requires transformation of data when predicting like so - https://github.com/tidymodels/bonsai/issues/45#:~:text=%3D%20penguins)-,new_data,-%3C%2D%0A%20%20%20%20penguins_subset_numeric%20%25. And even then, I'm getting thrown new errors after this transformation. ([LightGBM] [Fatal] The number of features in data (509) is not the same as it was in training data (488).) – Desmond Sep 07 '22 at 01:56
  • Related issues raised: https://github.com/tidymodels/bonsai/issues/44 and https://github.com/tidymodels/stacks/issues/145 – Desmond Sep 07 '22 at 07:56