I am new in R & getting a bit confused with steps need to be followed in classification
task by using tidymodels
.
kaggle dataset from: https://www.kaggle.com/c/home-credit-default-risk to perform a Classification on TARGET
variable of dataset.
This dataset has both missing & negative values and numeric + categorical data.
ISSUE: Getting error on collect_metrics() after fitting model in step 4
Steps I followed:
library(tidyverse)
library(caret)
library(tidymodels)
After some EDA I have followed below steps:
1. Data Partition - Train / test
set.seed(1234)
data_split <- initial_split(dt1, strata = TARGET)
dt1_train <- training(data_split)
dt1_test <- testing(data_split)
dim(dt1_train)
dim(dt1_test)
########## output ############
[1] 230634 122
[1] 76877 122
2. Recipes
rec2 <- recipe(TARGET ~ ., data = dt1_train) %>%
step_rm(contains("SK_ID_CURR")) %>% # removing id var
step_medianimpute(all_numeric()) %>%
step_modeimpute(all_nominal()) %>%
step_dummy(all_nominal(), - all_outcomes()) %>%
step_range(all_numeric()) %>% # to convert negative numbers into range 0 to 1
step_BoxCox(all_numeric()) %>% # as boxcox transformation works on positive numbers only
step_normalize(all_numeric()) %>%
step_zv(all_numeric()) %>%
step_nzv(all_numeric()) %>%
step_corr(all_numeric())
3. RFE
Sampling - to reduce data size for faster rfe resultsdt1_train_baked_sample <- bake(prepd_rec2, new_data = dt1_train) %>% sample_frac(0.05)
dim(dt1_train_baked_sample)
######## output ##########
[1] 11532 72
control <- rfeControl(functions = rfFuncs, method = "cv", verbose = FALSE)
system.time(
RFE_res <- rfe(x = subset(dt1_train_baked_sample, select = -TARGET),
y = dt1_train_baked_sample$TARGET,
sizes = c(7, 15, 20),
rfeControl = control
)
)
RFE_res$optVariables[1:15]
######## output ##########
[1] "EXT_SOURCE_2" "EXT_SOURCE_1"
[3] "DAYS_BIRTH" "AMT_INCOME_TOTAL"
[5] "CODE_GENDER_M" "DAYS_ID_PUBLISH"
[7] "AMT_CREDIT" "REG_CITY_NOT_WORK_CITY"
[9] "CNT_FAM_MEMBERS" "AMT_ANNUITY"
[11] "NAME_EDUCATION_TYPE_Higher.education" "REGION_POPULATION_RELATIVE"
[13] "NAME_EDUCATION_TYPE_Secondary...secondary.special" "REG_CITY_NOT_LIVE_CITY"
[15] "DAYS_REGISTRATION"
4. Model Building
knn_Spec <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
knn_Spec
knn_fit <- knn_Spec %>%
fit(TARGET ~ EXT_SOURCE_2+EXT_SOURCE_1+DAYS_BIRTH+AMT_INCOME_TOTAL+CODE_GENDER_M+DAYS_ID_PUBLISH+AMT_CREDIT+ REG_CITY_NOT_WORK_CITY+CNT_FAM_MEMBERS+AMT_ANNUITY+NAME_EDUCATION_TYPE_Higher.education+REGION_POPULATION_RELATIVE+NAME_EDUCATION_TYPE_Secondary...secondary.special+REG_CITY_NOT_LIVE_CITY+DAYS_REGISTRATION,
data = dt1_train_baked)
knn_fit
knn_fit %>% collect_metrics()
Error: No `collect_metric()` exists for this type of object
I am not sure how to get results like accuracy, spec, sens & predictions/prob from this.
Also tried below code but that gives an error:
knn_workflow <- workflow() %>%
add_recipe(rec2) %>%
add_model(knn_fit)
Error: `spec` must be a `model_spec`
knn_workflow <- workflow() %>%
add_recipe(rec2) %>%
add_model(knn_Spec)
knn_workflow %>% collect_metrics()
Error: No `collect_metric()` exists for this type of object.