0

I am using RStudio 2021.09.0 "Ghost Orchid" Release for macOS.

I am learning to use to C5.0 algorithm in R. For this I am following 'Machine Learning in R' by Brett Lantz. The dataset I am using is a modified version of one relating to loans obtained from a credit agency in Germany.

The data has no missing values, and no empty factor levels (this has caused the same error in other posts I have viewed). I have split the data into training and test tibbles using the initial_split() function in rsample package. The structure of the data is:

str(credit_train)

tibble [900 × 21] (S3: tbl_df/tbl/data.frame)
 $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 4 1 4 3 3 4 3 4 1 1 ...
 $ months_loan_duration: Factor w/ 33 levels "4","5","6","7",..: 18 22 18 16 30 18 9 9 14 9 ...
 $ credit_history      : Factor w/ 5 levels "critical","delayed",..: 1 1 1 5 4 2 5 5 5 5 ...
 $ purpose             : Factor w/ 10 levels "business","car (new)",..: 8 3 3 1 1 1 8 1 2 2 ...
 $ amount              : num [1:900] 2611 6187 2197 2767 6416 ...
 $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 1 3 5 3 1 1 1 1 3 1 ...
 $ employment_length   : Factor w/ 5 levels "> 7 yrs","0 - 1 yrs",..: 1 4 4 1 1 3 1 4 4 3 ...
 $ installment_rate    : Factor w/ 4 levels "1","2","3","4": 4 1 4 4 4 1 3 2 4 4 ...
 $ personal_status     : Factor w/ 4 levels "divorced male",..: 3 3 4 1 2 4 3 4 4 2 ...
 $ other_debtors       : Factor w/ 3 levels "co-applicant",..: 1 3 3 3 3 3 2 3 3 2 ...
 $ residence_history   : Factor w/ 4 levels "1","2","3","4": 3 4 4 2 3 2 3 4 3 4 ...
 $ property            : Factor w/ 4 levels "building society savings",..: 3 2 2 2 4 4 3 2 2 1 ...
 $ age                 : num [1:900] 46 24 43 61 59 32 40 36 30 29 ...
 $ installment_plan    : Factor w/ 3 levels "bank","none",..: 2 2 2 1 2 2 1 2 2 2 ...
 $ housing             : Factor w/ 3 levels "for free","own",..: 2 3 2 3 3 1 2 2 2 2 ...
 $ existing_credits    : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 1 1 2 1 1 1 ...
 $ default             : Factor w/ 2 levels "paid","default": 1 1 1 2 2 1 1 1 1 1 ...
 $ dependents          : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 1 1 2 1 ...
 $ telephone           : Factor w/ 2 levels "none","yes": 1 1 2 1 1 1 1 2 2 2 ...
 $ foreign_worker      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ job                 : Factor w/ 4 levels "management self-employed",..: 2 2 2 4 2 2 4 2 1 2 ...

My issue is specifically when I try to fit a model using a cost matrix. Without this cost matrix, the model does not throw this error. This is how I have created the cost matrix:

error_cost <- matrix(nrow = 2, 
                     ncol = 2,
                     dimnames = list(c('predict_paid','predict_default'), #rows
                                     c('actual_paid','actual_default')), #columns
                     data = c(0, 1, 4, 0))  

I must also point out that I have tried several ways to create this matrix, including literally copying the exact method given in the Lantz book, and they all result in this same error.

Here is the code I am using to try and fit the model.

c5_boostTree <- C5.0(default ~.,
                     credit_train,
                     trials = 3,
                     costs = error_cost)

However, this also happens if I use the x = credit_train %>% select(-default), y = credit_train$default rather than the formula approach, and any similar approaches I can find or think of. I am at a complete loss as to why I am getting this error.

c50 code called exit with value 1

Anyone have any ideas???

====================================

In response to a request for dput(credit_train, here is the output for dput(head(credit_train)), it seems too large otherwise:

structure(list(checking_balance = structure(c(4L, 1L, 4L, 3L, 
3L, 4L), .Label = c("< 0 DM", "> 200 DM", "1 - 200 DM", "unknown"
), class = "factor"), months_loan_duration = structure(c(18L, 
22L, 18L, 16L, 30L, 18L), .Label = c("4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15", "16", "18", "20", "21", 
"22", "24", "26", "27", "28", "30", "33", "36", "39", "40", "42", 
"45", "47", "48", "54", "60", "72"), class = "factor"), credit_history = structure(c(1L, 
1L, 1L, 5L, 4L, 2L), .Label = c("critical", "delayed", "fully repaid", 
"fully repaid this bank", "repaid"), class = "factor"), purpose = structure(c(8L, 
3L, 3L, 1L, 1L, 1L), .Label = c("business", "car (new)", "car (used)", 
"domestic appliances", "education", "furniture", "others", "radio/tv", 
"repairs", "retraining"), class = "factor"), amount = c(2611, 
6187, 2197, 2767, 6416, 3863), savings_balance = structure(c(1L, 
3L, 5L, 3L, 1L, 1L), .Label = c("< 100 DM", "> 1000 DM", "101 - 500 DM", 
"501 - 1000 DM", "unknown"), class = "factor"), employment_length = structure(c(1L, 
4L, 4L, 1L, 1L, 3L), .Label = c("> 7 yrs", "0 - 1 yrs", "1 - 4 yrs", 
"4 - 7 yrs", "unemployed"), class = "factor"), installment_rate = structure(c(4L, 
1L, 4L, 4L, 4L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"), 
    personal_status = structure(c(3L, 3L, 4L, 1L, 2L, 4L), .Label = c("divorced male", 
    "female", "married male", "single male"), class = "factor"), 
    other_debtors = structure(c(1L, 3L, 3L, 3L, 3L, 3L), .Label = c("co-applicant", 
    "guarantor", "none"), class = "factor"), residence_history = structure(c(3L, 
    4L, 4L, 2L, 3L, 2L), .Label = c("1", "2", "3", "4"), class = "factor"), 
    property = structure(c(3L, 2L, 2L, 2L, 4L, 4L), .Label = c("building society savings", 
    "other", "real estate", "unknown/none"), class = "factor"), 
    age = c(46, 24, 43, 61, 59, 32), installment_plan = structure(c(2L, 
    2L, 2L, 1L, 2L, 2L), .Label = c("bank", "none", "stores"), class = "factor"), 
    housing = structure(c(2L, 3L, 2L, 3L, 3L, 1L), .Label = c("for free", 
    "own", "rent"), class = "factor"), existing_credits = structure(c(2L, 
    2L, 2L, 2L, 1L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"), 
    default = structure(c(1L, 1L, 1L, 2L, 2L, 1L), .Label = c("paid", 
    "default"), class = "factor"), dependents = structure(c(1L, 
    1L, 2L, 1L, 1L, 1L), .Label = c("1", "2"), class = "factor"), 
    telephone = structure(c(1L, 1L, 2L, 1L, 1L, 1L), .Label = c("none", 
    "yes"), class = "factor"), foreign_worker = structure(c(2L, 
    2L, 2L, 2L, 2L, 2L), .Label = c("no", "yes"), class = "factor"), 
    job = structure(c(2L, 2L, 2L, 4L, 2L, 2L), .Label = c("management self-employed", 
    "skilled employee", "unemployed non-resident", "unskilled resident"
    ), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

1 Answers1

0

So after fighting with this issue for days I worked it out. The answer was very simple and staring me in the face the whole time. I had not set up my cost matrix correctly.

I had been naming the dimensions of the matrix so that I found them more explicitly legible than using the levels of the factor that they represented. However, the simple answer in the end is that you must set up the cost matrix with the dimnames of the factor levels that it refers to.

I had mistakenly thought that C5.0() transmogrified this under the hood somewhere. My mistake!