I have trained a churn tidymodel with customer data (more than 200 columns). Got a fairly good metrics using xgbboost but the issue is when tryng to predict on new data.
Predict function asks for target variable (churn) and I am a bit confused as this variable is not supposed to be present on real scenario data as this is the variable I want to predict.
sample code below, maybe I missed the point on procedure. Some questions arised:
should I execute prep() at the end of recipe?
should I execute recipe on my new data prior to predict?
why removing lines from recipe regarding target variable makes predict work?
why is asking for my target variable?
churn_recipe <- recipes::recipe(churn ~ ., data = churn_train) %>% recipes::step_naomit(everything(), skip = TRUE) %>% recipes::step_rm(c(v1, v2, v3, v4, v5, v6)) %>% # removing/commenting the next 2 lines makes predict() work recipes::step_string2factor(churn) %>% themis::step_downsample(churn) %>% recipes::step_dummy(all_nominal_predictors()) %>% recipes::step_novel(all_nominal(), -all_outcomes()) ### %>% prep() xgboost_model <- parsnip::boost_tree( mode = "classification", trees = 100 ) %>% set_engine("xgboost") %>% set_mode("classification") xgboost_workflow <- workflows::workflow() %>% add_recipe(churn_recipe) %>% add_model(xgboost_model) my_fit <- last_fit(xgboost_workflow, churn_split) collect_metrics(my_fit) churn_wf_model <- my_fit$.workflow[[1]] predict(churn_wf_model, new_data[1,]) Error: Can't subset columns that don't exist. x Column `churn` doesn't exist.
I am pretty sure some misconceptions on my side, but unable to solve this issue.
I am stuck in moving my model into production. Tidymodels documentation lack of such topic is enormous.