how to create a confusion matrix for xgboost in R

Question

I have already created my XGBoost classifier in R as in below code

#importing the dataset
XGBoostDataSet_Hr_Admin_8 <- read.csv("CompletedDataImputed_HR_Admin.csv")

#Use factor function to convert categorical data to numerical data
XGBoostDataSet_Hr_Admin_8$Salary = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Salary, levels =c('L','M', 'H', 'V'), labels =c(1,2,3,4)))
XGBoostDataSet_Hr_Admin_8$Rude_Behavior = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Rude_Behavior, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Feeling_undervalued =as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Feeling_undervalued, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Overall_satisfaction = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Overall_satisfaction, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Raises_frozen = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Raises_frozen, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Poor_Conditions = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Poor_Conditions, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Growth_not_available = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Growth_not_available, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Workplace_Conflict = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Workplace_Conflict, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Employee_Turnover = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Employee_Turnover, levels=c('Y', 'N'), labels =c(1,0)))

#split the data in train dataset and test dataset
library(caTools)
split = sample.split(XGBoostDataSet_Hr_Admin_8$Employee_Turnover,SplitRatio = 0.7)
training_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==TRUE)
test_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==FALSE)

#fitting XGBoost to the Training Test
library(xgboost)
classifier9 = xgboost(data = as.matrix(training_set8[-10]), label = training_set8$Employee_Turnover, nrounds = 10)

Now, I need to create a confusion matrix for the XGBoost.

I have searched on the net and unfortunately can't find the solution.

Can anyone please help me out.

Thanks in advance

s__ · Answer 1 · 2019-11-13T15:55:31.670

You can use the caret::confusionMatrix() function, but you need some work on your output. Clearly you need a vector of the real result (test dataset), to compare the calculated results and the real one:

library(xgboost)


#Use factor function to convert categorical data to numerical data
XGBoostDataSet_Hr_Admin_8$Salary = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Salary, levels =c('L','M', 'H', 'V'), labels =c(1,2,3,4)))
XGBoostDataSet_Hr_Admin_8$Rude_Behavior = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Rude_Behavior, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Feeling_undervalued =as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Feeling_undervalued, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Overall_satisfaction = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Overall_satisfaction, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Raises_frozen = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Raises_frozen, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Poor_Conditions = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Poor_Conditions, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Growth_not_available = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Growth_not_available, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Workplace_Conflict = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Workplace_Conflict, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Employee_Turnover = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Employee_Turnover, levels=c('Y', 'N'), labels =c(1,0)))

# here ifelse 0 1
XGBoostDataSet_Hr_Admin_8$Employee_Turnover = ifelse(XGBoostDataSet_Hr_Admin_8$Employee_Turnover == 1,0,1)

library(caTools)


split = sample.split(XGBoostDataSet_Hr_Admin_8$Employee_Turnover,SplitRatio = 0.7)
training_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==TRUE)
test_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==FALSE)

bst <- xgboost(data = as.matrix(training_set8[,-10]), label = training_set8$Employee_Turnover, max_depth = 2,
               eta = 0.5, nthread = 2, nrounds = 5, objective = "binary:logistic")  

# you've to do your prediction here
pred <- predict(bst, as.matrix(test_set8[,-10]))

# and transform them in a 0 1 variable, you can choose the value to get 1
pred <-  as.numeric(pred > 0.5)

library(caret)
confusionMatrix(factor(pred),factor(test_set8$Employee_Turnover))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 67  2
         1  0 16

               Accuracy : 0.9765          
                 95% CI : (0.9176, 0.9971)
    No Information Rate : 0.7882          
    P-Value [Acc > NIR] : 4.626e-07       

                  Kappa : 0.9265          

 Mcnemar's Test P-Value : 0.4795          

            Sensitivity : 1.0000          
            Specificity : 0.8889          
         Pos Pred Value : 0.9710          
         Neg Pred Value : 1.0000          
             Prevalence : 0.7882          
         Detection Rate : 0.7882          
   Detection Prevalence : 0.8118          
      Balanced Accuracy : 0.9444          

       'Positive' Class : 0

Hello s_t, am getting this error when trying this part pred <- predict(classifier8, test_set8$Employee_Turnover ,type = "response") Error in xgb.DMatrix(newdata, missing = missing) : xgb.DMatrix does not support construction from double can you please guide me from where I have already generate my classifier as follows classifier8 = xgboost(data = as.matrix(training_set8[-10]), label = training_set8$Employee_Turnover, nrounds = 10) — ibocus, Nov 13 '19 at 14:04
hey s_t and ibocus, I just saw your answer, i think @ibocus did not use xgb.DMatrix like the example, but just fed xgboost a matrix and labels (which I am not sure of) — StupidWolf, Nov 13 '19 at 14:09
i have updated the full script in my question, can you please have a look and see how from there i can create the confusion matrix — ibocus, Nov 13 '19 at 14:23
Unluckily without your data is useless (try to post if you can train and test, using `dput()`function, or some fake one similar to yours). However have you tried: `pred <- predict(classifier8, test_set8 ,type = "response") `? — s__, Nov 13 '19 at 14:25
try your code pred <- predict(classifier8, test_set8 ,type = "response"), however i got this error, "Error in xgb.DMatrix(newdata, missing = missing) : xgb.DMatrix does not support construction from list" — ibocus, Nov 13 '19 at 14:29
@s_t, can i know what is the 400 means in the code, i have 284 records in my dataset, should i change this from 400 to 284 and secondly i have 10 columns in my dataset and why there are 9 when converting is because one column is percentage_increment and it values range from 0 to 100, what i suppose to do with this column? — ibocus, Nov 13 '19 at 15:28
@Ibocus these are **fake** data to simulate yours to demonstrate that this code works, edited with data similar to yours. — s__, Nov 13 '19 at 15:42
@ibocus you're welcome, I've edited with the real data published above, if you need it. — s__, Nov 13 '19 at 15:56
@s_t, is it normal that when adding up the values of the confusion matrix, the total is 85 and though i have 284 records in my dataset — ibocus, Nov 13 '19 at 16:01
Yes because you're predicting only the test dataset, which: `length(pred); nrow(test_set8);` are equal to 85, so you're predicting for each row of test, training the model on the training dataset. — s__, Nov 13 '19 at 16:08
@s_t , you are absolutely right, i just got confused.. thanks a lot for making it crystal clear for me.. thanks :) — ibocus, Nov 13 '19 at 16:16
@s_t, i was just wondering where did you accumulate this huge amount of knowledge, is it possible to share your story — ibocus, Nov 13 '19 at 16:18

score 1 · Answer 2 · answered Nov 13 '19 at 13:57

Provide some example data when asking questions in future.

The code below creates a confusion matrix using the example from predict.xgb.Booster

library("xgboost")
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test

bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
               eta = 0.5, nthread = 2, nrounds = 5, objective = "binary:logistic")
## Predict class probability for new data
pred <- predict(bst, test$data)
## Use arbitrary cutoff of 0.5 for classifier
table(test$label, as.numeric(pred > 0.5))
#        0    1
#  0   825   10
#  1     9  767

StupidWolf · Accepted Answer · 2019-11-13T15:35:17.050

Things to note, you need to convert your training_set8$Employee_Turnover to 0s and 1s. Hopefully you have done that, if not see my example below.

Second, you need to specify, objective = "binary:logistic" when doing xgboost, this does classification.

So starting with what you have:

library(caTools)
library(xgboost)
library(caret)
set.seed(12345)
# reproducible results

XGBoostDataSet_Hr_Admin_8 <- read.csv("CompletedDataImputed_HR_Admin.csv")

#Use factor function to convert categorical data to numerical data
XGBoostDataSet_Hr_Admin_8$Salary = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Salary, levels =c('L','M', 'H', 'V'), labels =c(1,2,3,4)))
XGBoostDataSet_Hr_Admin_8$Rude_Behavior = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Rude_Behavior, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Feeling_undervalued =as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Feeling_undervalued, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Overall_satisfaction = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Overall_satisfaction, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Raises_frozen = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Raises_frozen, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Poor_Conditions = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Poor_Conditions, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Growth_not_available = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Growth_not_available, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))
XGBoostDataSet_Hr_Admin_8$Workplace_Conflict = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Workplace_Conflict, levels=c('Y', 'M', 'N'), labels =c(1,2,3)))

For this part, we set the labels correctly as 0 and 1

#set levels
lvl = c('N', 'Y')
# sorry I have to do it like this, it's too long for me to read
lb = as.character(XGBoostDataSet_Hr_Admin_8$Employee_Turnover)
lb = as.numeric(factor(lb,levels=lvl))-1
XGBoostDataSet_Hr_Admin_8$Employee_Turnover = lb

And we do the split into train + test as you have:

#split the data in train dataset and test dataset
split = sample.split(XGBoostDataSet_Hr_Admin_8$Employee_Turnover,SplitRatio = 0.7)
training_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==TRUE)
test_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==FALSE)

Do the fit:

#fitting XGBoost to the Training Test
classifier9 = xgboost(data = as.matrix(training_set8[-10]), 
label = training_set8$Employee_Turnover, nrounds = 10)

Now we get the prediction in terms of probability and convert

pred <- predict(classifier9, as.matrix(training_set8[-10]))
# we convert to predicted labels
pred_label <- lvl[as.numeric(pred>0.5)+1]
# we get the observed label, or iris$Species
actual_label <- lvl[as.numeric(training_set8$Employee_Turnover)+1]

Last confusion matrix:

# confusion matrix
table(pred_label,actual_label)
          actual_label
pred_label   N   Y
         N  41   0
         Y   0 158

Or using caret:

confusionMatrix(factor(pred_label,levels=lvl),
factor(actual_label,levels=lvl))
    Confusion Matrix and Statistics

              Reference
    Prediction   N   Y
             N  41   0
             Y   0 158

This is the actual data (kindly provided by OP):

structure(list(Salary = structure(c(2L, 3L, 2L, 3L, 2L, 3L, 2L, 
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 
3L, 2L, 3L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 
3L, 1L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 
2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
2L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 
3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 
2L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 
2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 
2L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 
2L, 2L, 3L, 2L, 3L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 3L, 
3L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 1L, 3L, 3L, 2L, 
3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
3L, 3L, 4L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 
3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
2L, 3L, 3L, 2L, 3L), .Label = c("H", "L", "M", "V"), class = "factor"), 
    Percentage_Increment = c(5, 10, 7, 7, 5, 7, 5, 5, 10, 5, 
    5, 5, 5, 5, 5, 10, 5, 5, 10, 10, 5, 5, 5, 5, 5, 5, 5, 5, 
    5, 10, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 
    5, 5, 10, 7, 5, 5, 5, 5, 10, 10, 10, 5, 5, 5, 7, 10, 5, 5, 
    5, 7, 10, 5, 7, 5, 5, 10, 10, 10, 5, 5, 10, 5, 5, 5, 5, 5, 
    5, 5, 5, 10, 5, 5, 7, 7, 5, 10, 5, 5, 5, 5, 5, 7, 5, 10, 
    5, 5, 5, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 5, 5, 5, 10, 5, 5, 
    5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 5, 5, 5, 5, 10, 5, 10, 5, 5, 
    5, 7, 5, 7, 10, 7, 10, 5, 10, 10, 5, 7, 5, 5, 10, 5, 5, 5, 
    10, 5, 7, 5, 5, 5, 5, 10, 3, 5, 5, 10, 10, 5, 5, 7, 10, 5, 
    5, 5, 5, 5, 5, 5, 10, 5, 7, 5, 5, 5, 5, 5, 7, 5, 7, 5, 5, 
    5, 5, 5, 5, 5, 5, 5, 5, 7, 5, 7, 5, 5, 5, 10, 10, 5, 5, 5, 
    10, 5, 10, 10, 10, 10, 7, 5, 7, 5, 5, 10, 1, 10, 30, 1, 0.02, 
    5, 1, 11, 1, 3, 10, 1, 11, 1, 5, 10, 2.2, 18, 4, 10, 8, 1, 
    5, 9, 5, 4, 15, 15, 4, 10, 12, 1, 9, 3, 2.5, 5, 20, 30, 10, 
    5, 100, 10, 1, 1, 8, 1, 1, 2, 1, 5, 10, 1, 50, 50, 2, 3, 
    25, 1, 1), Rude_Behavior = structure(c(3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 
    3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 
    3L, 1L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 
    3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 3L, 
    3L, 1L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 2L, 3L, 2L, 1L, 1L, 
    2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 2L, 2L, 1L, 3L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 
    3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 
    2L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 1L), .Label = c("M", "N", 
    "Y"), class = "factor"), Feeling_undervalued = structure(c(1L, 
    2L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 
    3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 
    3L, 1L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 
    3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 
    2L, 3L, 1L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
    3L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 
    3L, 2L, 3L, 3L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 2L, 3L, 2L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 
    3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 
    2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 
    3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 
    3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 1L, 2L, 1L, 
    3L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 
    2L, 2L, 1L, 3L, 1L, 2L, 3L, 1L, 3L, 1L, 1L, 2L, 3L, 3L, 1L, 
    2L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 2L, 1L, 3L, 
    2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 3L, 2L), .Label = c("M", 
    "N", "Y"), class = "factor"), Overall_satisfaction = structure(c(2L, 
    3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 
    3L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 
    3L, 2L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 3L, 3L, 3L, 
    2L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 2L, 3L, 
    3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 
    3L, 3L, 1L, 2L, 3L, 3L, 2L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 
    1L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 
    3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
    2L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 
    3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 
    3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 
    1L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
    3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 
    2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 3L, 2L, 1L, 1L, 
    2L, 3L, 1L, 2L, 1L, 2L, 2L, 2L, 3L, 1L, 2L, 3L, 1L), .Label = c("M", 
    "N", "Y"), class = "factor"), Poor_Conditions = structure(c(3L, 
    1L, 3L, 2L, 3L, 3L, 3L, 1L, 2L, 3L, 1L, 3L, 3L, 1L, 2L, 3L, 
    3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 
    3L, 1L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 
    3L, 2L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 
    3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 
    3L, 2L, 3L, 1L, 3L, 3L, 1L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 
    1L, 1L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 1L, 
    3L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 
    3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 1L, 3L, 
    3L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
    3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 1L, 3L, 
    2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 1L, 3L, 
    3L, 1L, 3L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 
    2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 3L, 
    1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 3L, 1L, 2L, 3L, 
    3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L), .Label = c("M", 
    "N", "Y"), class = "factor"), Raises_frozen = structure(c(2L, 
    3L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 
    3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 
    3L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 
    2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 3L, 
    2L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    2L, 3L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 
    3L, 2L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 
    2L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 
    3L, 3L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
    2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    2L, 3L, 3L, 3L, 1L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 
    3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 3L, 1L, 3L, 
    1L, 1L, 1L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 3L, 2L, 1L, 
    3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 2L, 1L), .Label = c("M", 
    "N", "Y"), class = "factor"), Growth_not_available = structure(c(1L, 
    3L, 1L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 
    2L, 3L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 3L, 
    2L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 
    3L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 2L, 1L, 
    3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 
    1L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 
    3L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 
    3L, 3L, 1L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 3L, 
    3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 
    3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 
    3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 2L, 
    2L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 
    1L, 3L, 2L, 2L, 1L, 1L, 2L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 2L, 
    1L, 2L, 1L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 
    3L, 2L, 1L, 3L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 2L, 3L), .Label = c("M", 
    "N", "Y"), class = "factor"), Workplace_Conflict = structure(c(3L, 
    3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 
    3L, 2L, 3L, 1L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 1L, 3L, 2L, 3L, 
    3L, 3L, 3L, 2L, 3L, 3L, 2L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 
    2L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 
    3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 
    2L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 2L, 3L, 
    3L, 1L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 
    3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 
    3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 
    3L, 2L, 3L, 3L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 
    3L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 3L, 2L, 3L, 1L, 
    3L, 2L, 2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 3L, 2L, 
    2L, 2L, 1L, 1L, 1L, 2L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
    3L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 
    1L, 3L, 3L, 3L, 1L, 2L, 2L, 1L, 3L, 2L, 3L, 3L, 2L), .Label = c("M", 
    "N", "Y"), class = "factor"), Employee_Turnover = structure(c(2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("N", 
    "Y"), class = "factor")), class = "data.frame", row.names = c(NA, 
-284L))

Here what i've done XGBoostDataSet_Hr_Admin_8 <- read.csv("CompletedDataImputed_HR_Admin.csv") XGBoostDataSet_Hr_Admin_8$Salary = as.numeric(factor(XGBoostDataSet_Hr_Admin_8$Salary, levels =c('L','M', 'H', 'V'), labels =c(1,2,3,4))) library(caTools) split = sample.split(XGBoostDataSet_Hr_Admin_8$Employee_Turnover,SplitRatio = 0.7) training_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==TRUE) test_set8 = subset(XGBoostDataSet_Hr_Admin_8, split==FALSE) library(xgboost) classifier9 = xgboost(data = as.matrix(training_set8[-10]), label = training_set8$Employee_Turnover, nrounds = 10) — ibocus, Nov 13 '19 at 14:17
can you include this in the question? It's very hard for me to read it in comments — StupidWolf, Nov 13 '19 at 14:19
@ibocus, the problem lies with the label, it's 1s and 2s, not 0s and 1s. I will edit my answer, with some code to try and hopefully solve it — StupidWolf, Nov 13 '19 at 14:35
thanks very much and waiting for your updated code. Between is there any solution regarding converting categorical to numerical labels to fit the xgboost classifier without getting my issue of 1s and 2s — ibocus, Nov 13 '19 at 14:43
when fitting XGBoost to the Training Test, am getting this error Error in xgb.DMatrix(data, label = label, missing = missing) : 'data' has class 'character' and length 1791. 'data' accepts either a numeric matrix or a single filename. — ibocus, Nov 13 '19 at 14:49
Hmmm quite weird and I have no clue honestly how you end up with that when previous times work.. Can you share a link to your data or dput it ? It's very hard for all of us to troubleshoot as you can see — StupidWolf, Nov 13 '19 at 14:52
just let me know a platform where i can share the dataset with you — ibocus, Nov 13 '19 at 15:01
share a google link or something like that? Haha sorry I seldom share links on SO.. or you do this, dput(head(XGBoostDataSet_Hr_Admin_8,20)), if the output is not too big, paste it in your post — StupidWolf, Nov 13 '19 at 15:02
https://drive.google.com/open?id=11AliKO3E5KoHuv0sXZNPXVeiOSTl9yHR — ibocus, Nov 13 '19 at 15:07
try this link...https://drive.google.com/drive/folders/11AliKO3E5KoHuv0sXZNPXVeiOSTl9yHR?usp=sharing — ibocus, Nov 13 '19 at 15:10
@ibocus, done.. it wasn't so bad. Next time do dput like I have done, and I think your problem would have been solved in < 30min..Lol — StupidWolf, Nov 13 '19 at 15:36
I'm glad it worked too :) Next time really put the data used for the model. I once made the same mistake like you, so I am always wary of it. — StupidWolf, Nov 13 '19 at 15:56
is it normal that when adding up the numbers of the confusion matrix i get 199 when infact i have 285 records in my dataset? — ibocus, Nov 13 '19 at 16:02
@ibocus, you ran the model on the training set, and did the confusion matrix on the training set right. Training set has 199 — StupidWolf, Nov 13 '19 at 16:07
you are absolutely right, i just got confused.. thanks a lot for making it crystal clear for me.. thanks :) — ibocus, Nov 13 '19 at 16:15
i was just wondering where did you accumulate this huge amount of knowledge, is it possible to share your story — ibocus, Nov 13 '19 at 16:19

how to create a confusion matrix for xgboost in R

3 Answers3