I have trained and tested a random forest model in R using tidymodels. Now i want to use the same model to predict a completely new dataset (not the training dataset).
For example Julia silge, had explained the steps to train, test and evaluate a model in this blog post : Juliasilge's palmer penguins. I wanted to apply this model on a completely new dataset with same columns (except the prediction column(here sex))
Can anyone help me with the code for predicting on a new dataset.
I can explain what i have tried with a sample dataset
library(palmerpenguins)
penguins <- penguins %>%
filter(!is.na(sex)) %>%
select(-year, -island)
#Selecting the fitst 233 rows for training and testing
penguins_train_test<-penguins[1:233,]
#Splitting few other rows out of the parent data and assume that this is the new dataset which needs a prediction (not testing). Hence for this assumption, I had removed the column named "Sex", which needs to be predicted by fitting the model (not testing)
penguins_newdata<-penguins[233:333,-6]
set.seed(123)
penguin_split <- initial_split(penguins_train_test, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
creating the model specifications.
rf_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger")
penguin_wf <- workflow() %>%
add_formula(sex ~ .)
Applying to the test data
penguin_final <- penguin_wf %>%
add_model(rf_spec) %>%
last_fit(penguin_split)
collect_metrics(penguin_final)
Similarly applying to the new dataset "penguins_newdata"
penguins_newdata
penguin_wf %>%
add_model(rf_spec) %>%
fit(penguins_newdata)
The result i got is the following error
Error: The following outcomes were not found in `data`: 'sex'.
I tried this way too
fit(penguin_wf, penguins_newdata)
This is thee error i got
Error: The workflow must have a model. Provide one with `add_model()`.
Thank you in advance.