I am trying to retain an ID on the row when predicting using a Random Forest model to merge back on to the original dataframe. I am using step_naomit in the recipe that removes the rows with missing data when I bake the training data, but also removes the records with missing data on the testing data. Unfortunately, I don't have an ID to easily know which records were removed so I can accurately merge back on the predictions.
I have tried to add an ID column to the original data, but bake will remove any variable not included in the formula (and I don't want to include ID in the formula). I also thought I may be able to retain the row.names from the original table to merge on, but it appears the row.name is reset upon baking as well.
I realize I can remove the NA values prior to the recipe to solve this problem, but then what is the point of step_naomit in the recipe? I also tried skip=TRUE in the step_naomit, but then I get an error for missing data when fitting the model (only for random forest). I feel I am missing something here in tidymodels that would allow me to retain all the rows prior to baking?
See example:
## R 3.6.1 ON WINDOWS 10 MACHINE
require(tidyverse)
require(tidymodels)
require(ranger)
set.seed(123)
temp <- iris %>%
dplyr::mutate(Petal.Width = case_when(
round(Sepal.Width) %% 2 == 0 ~ NA_real_, ## INTRODUCE NA VALUES
TRUE ~ Petal.Width))
mySplit <- rsample::initial_split(temp, prop = 0.8)
myRecipe <- function(dataFrame) {
recipes::recipe(Petal.Width ~ ., data = dataFrame) %>%
step_naomit(all_numeric()) %>%
prep(data = dataFrame)
}
myPred <- function(mySplit,myRecipe) {
train_set <- training(mySplit)
test_set <- testing(mySplit)
train_prep <- myRecipe(train_set)
analysis_processed <- bake(train_prep, new_data = train_set)
model <- rand_forest(
mode = "regression",
mtry = 3,
trees = 50) %>%
set_engine("ranger", importance = 'impurity') %>%
fit(Sepal.Width ~ ., data=analysis_processed)
test_processed <- bake(train_prep, new_data = test_set)
test_processed %>%
bind_cols(myPrediction = unlist(predict(model,new_data=test_processed)))
}
getPredictions <- myPred(mySplit,myRecipe)
nrow(getPredictions)
## 21 ROWS
max(as.numeric(row.names(getPredictions)))
## 21
nrow(testing(mySplit))
## 29 ROWS
max(as.numeric(row.names(testing(mySplit))))
## 150