0

I am new in R & getting a bit confused with steps need to be followed in classification task by using tidymodels.

kaggle dataset from: https://www.kaggle.com/c/home-credit-default-risk to perform a Classification on TARGET variable of dataset.

This dataset has both missing & negative values and numeric + categorical data.

ISSUE: Getting error on collect_metrics() after fitting model in step 4

Steps I followed:

library(tidyverse)
library(caret)
library(tidymodels)

After some EDA I have followed below steps:

1. Data Partition - Train / test

set.seed(1234)
data_split <- initial_split(dt1, strata = TARGET)

dt1_train <- training(data_split)
dt1_test <- testing(data_split)
dim(dt1_train)
dim(dt1_test)

########## output ############
[1] 230634    122
[1] 76877   122

2. Recipes

rec2 <- recipe(TARGET ~ ., data = dt1_train) %>% 
    step_rm(contains("SK_ID_CURR")) %>% # removing id var
    step_medianimpute(all_numeric()) %>% 
    step_modeimpute(all_nominal()) %>% 
    step_dummy(all_nominal(), - all_outcomes()) %>% 
    step_range(all_numeric()) %>% # to convert negative numbers into range 0 to 1 
    step_BoxCox(all_numeric()) %>% # as boxcox transformation works on positive numbers only
    step_normalize(all_numeric()) %>% 
    step_zv(all_numeric()) %>% 
    step_nzv(all_numeric()) %>%
    step_corr(all_numeric())

3. RFE

Sampling - to reduce data size for faster rfe results
dt1_train_baked_sample <- bake(prepd_rec2, new_data = dt1_train) %>% sample_frac(0.05)

dim(dt1_train_baked_sample)

######## output ##########
[1] 11532    72
control <- rfeControl(functions = rfFuncs, method = "cv", verbose = FALSE)

system.time(
  RFE_res <- rfe(x = subset(dt1_train_baked_sample, select = -TARGET),
                 y = dt1_train_baked_sample$TARGET, 
                 sizes = c(7, 15, 20),
                 rfeControl = control
                 )
) 
RFE_res$optVariables[1:15] 

######## output ##########
[1] "EXT_SOURCE_2"                                      "EXT_SOURCE_1"                                     
 [3] "DAYS_BIRTH"                                        "AMT_INCOME_TOTAL"                                 
 [5] "CODE_GENDER_M"                                     "DAYS_ID_PUBLISH"                                  
 [7] "AMT_CREDIT"                                        "REG_CITY_NOT_WORK_CITY"                           
 [9] "CNT_FAM_MEMBERS"                                   "AMT_ANNUITY"                                      
[11] "NAME_EDUCATION_TYPE_Higher.education"              "REGION_POPULATION_RELATIVE"                       
[13] "NAME_EDUCATION_TYPE_Secondary...secondary.special" "REG_CITY_NOT_LIVE_CITY"                           
[15] "DAYS_REGISTRATION" 

4. Model Building

knn_Spec <- nearest_neighbor() %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_Spec
knn_fit <- knn_Spec %>%
  
  fit(TARGET ~ EXT_SOURCE_2+EXT_SOURCE_1+DAYS_BIRTH+AMT_INCOME_TOTAL+CODE_GENDER_M+DAYS_ID_PUBLISH+AMT_CREDIT+   REG_CITY_NOT_WORK_CITY+CNT_FAM_MEMBERS+AMT_ANNUITY+NAME_EDUCATION_TYPE_Higher.education+REGION_POPULATION_RELATIVE+NAME_EDUCATION_TYPE_Secondary...secondary.special+REG_CITY_NOT_LIVE_CITY+DAYS_REGISTRATION,
      
      data = dt1_train_baked)

knn_fit
knn_fit %>% collect_metrics()

Error: No `collect_metric()` exists for this type of object

I am not sure how to get results like accuracy, spec, sens & predictions/prob from this.

Also tried below code but that gives an error:

knn_workflow <- workflow() %>% 
  
  add_recipe(rec2) %>% 
  add_model(knn_fit)

Error: `spec` must be a `model_spec`
knn_workflow <- workflow() %>% 
  
  add_recipe(rec2) %>% 
  add_model(knn_Spec)

knn_workflow %>% collect_metrics()

Error: No `collect_metric()` exists for this type of object.
desertnaut
  • 57,590
  • 26
  • 140
  • 166
ViSa
  • 1,563
  • 8
  • 30
  • I don't think it makes sense to combine `rfe()` with random forest fitting functions together with a k-nearest-neighbor model; you should probably pick one or the other. – Julia Silge Dec 07 '20 at 04:44
  • Hi @JuliaSilge , I was just practicing `tidymodels` to go through all the steps on a real dataset & reduce from 200+ vars (after onehotencoding) to minimum as possible by using `rfe` and will be implementing a set of other algos as well like - `glm`, `boosting`, `svm` etc. **Issue** is I am unable to collect `result metrics or predicions` after fitting any model. For example `glm_fit %>% collect_metrics() ` gives error. I also referred a video from your channel https://youtu.be/s3TkvZM60iU?t=2371 but this also demonstrates collection of results only after `Cross Validation` & `fit_resamples()` – ViSa Dec 07 '20 at 10:08
  • I think `predict(glm_fit, new_data = juiced_dt1_train)` , `predict(glm_fit, new_data = juiced_dt1_train, type = "prob")` will work in this case: Ref: https://parsnip.tidymodels.org/reference/predict.model_fit.html – ViSa Dec 07 '20 at 10:50
  • 1
    If you [check out the documentation for those `collect_` functions](https://tune.tidymodels.org/reference/collect_predictions.html), you'll notice that they are only built to work on the output of tuning functions like `tune_grid()` and `fit_resamples()`. – Julia Silge Dec 08 '20 at 04:05
  • yes @JuliaSilge, thanks for sharing the link !! – ViSa Dec 08 '20 at 05:49

0 Answers0