0

Recently I learn to use tidymodels to build up machine learning workflow, but when I use the workflow to make the prediction on test set, it raises the error "Missing data in columns", but I am sure that neither the train and the test set has missing data. Here is my code and example:

# Imformation of the data:the Primary_type in test set has several novel levels
str(train_sample)
tibble [500,000 x 3] (S3: tbl_df/tbl/data.frame)
 $ ID          : num [1:500000] 6590508 2902772 6162081 7777470 7134849 ...
 $ Primary_type: Factor w/ 29 levels "ARSON","ASSAULT",..: 16 8 3 3 28 7 3 4 25 15 ...
 $ Arrest      : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 2 1 1 1 1 ...

str(test_sample)
tibble [300,000 x 3] (S3: tbl_df/tbl/data.frame)
 $ ID          : num [1:300000] 8876633 9868538 9210518 9279377 8707153 ...
 $ Primary_type: Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 31 7 2 8 7 2 31 18 ...
 $ Arrest      : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 2 1 1 1 2 2 ...

# set the recipe
rec <- recipe(Arrest ~ ., data = train_sample) %>% 
  update_role(ID, new_role = "ID") %>% 
  step_novel(Primary_type)

# set the model
rf_model <- rand_forest(trees = 10) %>%
  set_engine("ranger", seed = 100, num.threads = 12, verbose = TRUE) %>%
  set_mode("classification")

# set the workflow
wf <- workflow() %>% 
  add_recipe(rec) %>% 
  add_model(rf_model)

# fit the train data
wf_fit <- wf %>% fit(train_sample)

# predict the test data
wf_pred <- wf_fit %>% predict(test_sample)

The prediction raises the following errer:

ERROR:Missing data in columns: Primary_type.

However, when I build up the workflow seperately using prep() and bake(), the prediction does not raise error:

# set the workflow seperately
train_prep <- prep(rec, training = train_sample)
train_bake <- bake(train_prep, new_data = NULL)
test_bake <- bake(train_prep, new_data = test_sample)

# fit the baked train data
rf_model_fit <- rf_model %>% fit(Arrest ~ Primary_type, train_bake)

# predict the baked test data
rf_model_pred <- rf_model_fit %>% predict(test_bake) # No missing data error

I find that the levels of Primary_type in both baked datasets are indentical, that means the step_novel() works.

# compare the levels bewteen baked data sets
identical(levels(train_bake$Primary_type), levels(test_bake$Primary_type))
[1] TRUE

So, why the prediction fails in the workflow and succeeds when do it seperately? And how the missing data generates? Thanks a lot.

Kim.L
  • 121
  • 10

1 Answers1

6

I recommend that you check out this advice on "Ordering of Steps", especially the section on handling levels in categorical data. You should use step_novel() before other factor handling operations.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Thanks for your advice! I read the section and modify the order of the `step_novel()`, put it just after the update_role(), but the error still raises when I use the workflow to predict. Meanwhile, when I seperate the workflow using `prep()` and `bake() ` to deal with the datasets and predict, it makes no error, I still don't no why. I have rewrite my question more clearly after testing. Thanks! – Kim.L Jun 18 '21 at 04:07
  • 1
    Ah OK, check out [this issue](https://github.com/tidymodels/recipes/issues/627) and add `allow_novel_levels = TRUE` to your recipe blueprint, if you want to predict on new factor levels you didn't see during training. Be sure to think through what that means, though, since you didn't see that in training at all. – Julia Silge Jun 18 '21 at 04:34
  • The issue solves my problem! After setting `allow_novel_levels = TRUE` in the `add_recipe()`, the missing data error disappears! Thanks a lot! I will keep looking deeper in the tidymodels. As you said, think through the novel levels in new data is important, but in real world production there is always some novel levels appear in new data, and I can not train the model every time they appear, so the functions like `step_novel()`, `step_unknown()` and `step_other()` in tidymodels meet my need so well! – Kim.L Jun 18 '21 at 08:27