xgboost error message about numerical variable and label

Question

I use the xgboost function in R, and I get the following error message

bst <- xgboost(data = germanvar, label = train$Creditability, max.depth = 2, eta = 1,nround = 2, objective = "binary:logistic")

Error in xgb.get.DMatrix(data, label, missing, weight) : 
  xgboost only support numerical matrix input,
           use 'data.matrix' to transform the data.
In addition: Warning message:
In xgb.get.DMatrix(data, label, missing, weight) :
  xgboost: label will be ignored.

Following is my full code.

credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=credit$Creditability, p=0.7, list=FALSE) 
train<-credit[intrain, ]
test<-credit[-intrain, ]


germanvar<-train[,2:21]
str(germanvar)
bst <- xgboost(data = germanvar, label = train$Creditability, max.depth = 2, eta = 1,
               nround = 2, objective = "binary:logistic")

Data has a mixture of continuous and categorical variables.

However, because of the error message that only continuous variables can be used, all the variables were recognized as continuous, but the error message reappears.

How can I solve this problem???

*"because of the error message that only continuous variables can be used, all the variables were recognized as continuous, but the error message reappears"* is incorrect. This is what happens: because you have non-continuous variables, you get an error telling you to only use continuous variables. This stops the program. — Gregor Thomas, Jul 11 '17 at 05:28
The error message is very nice: it tells you you can't have non-continuous variables. The solution it to code your categorical variables as numeric. The most common way to do this is called "one-hot encoding" or "dummy variables". `model.matrix()` is a function that helps you do this - you can find many examples on Stack Overflow searching for "[r] dummy variables" or in the help at `?model.matrix`. — Gregor Thomas, Jul 11 '17 at 05:29
@Gregor Oh thank you. But do I have to change it to a dummy variable? Is there a problem if I forcibly execute a factor type variable by recognizing it as an int type? — 신익수, Jul 11 '17 at 06:03
xgboost uses decision trees, which look for cut points in continuous data. If your factor is ordered, that is A < B < C < D, then that can make sense because a node in a tree will pick a single cut point. If your factor is not ordered, then dummy variables will make more efficient use of the information. — Gregor Thomas, Jul 11 '17 at 06:10
@Gregor Thank you for good information :). Have a nice day!! — 신익수, Jul 11 '17 at 06:14
@Gregor sorry for inconveniencing to you. If the factor data is not an ordinal type, is it okay to force it to recognize it as an int variable? Of course, it is more effective to use dummy variables as you said. — 신익수, Jul 11 '17 at 06:24
It's not great. Say level B is special and needs to be separated from the rest, it will take one node to do it using dummy variables. However, it will take a minimum of two nodes to do it using an int - if it is recognized at all. This is the efficiency loss. — Gregor Thomas, Jul 11 '17 at 06:27

score 5 · Answer 1 · answered Jul 11 '17 at 22:07

5

So if you have categorical variables that are represented as numbers, it is not an ideal representation. But with deep enough trees you can get away with it. The trees will partition it eventually. I don't prefer that approach but it keeps you columns minimal, and can succeed given the right setup.

Note that xgboost takes numeric matrix as data, and numeric vector as label.

NOT INTEGERS :)

The following code will train with the inputs cast properly

credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=credit$Creditability, p=0.7, list=FALSE) 
train<-credit[intrain, ]
test<-credit[-intrain, ]


germanvar<-train[,2:21]
label <- as.numeric(train$Creditability) ## make it a numeric NOT integer
data <-  as.matrix(germanvar)  # to matrix
mode(data) <- 'double'  # to numeric i.e double precision


bst <- xgboost(data = data, label = label, max.depth = 2, eta = 1,
               nround = 2, objective = "binary:logistic")

answered Jul 11 '17 at 22:07

T. Scharf

4,644
25
27

In conclusion, can I change categorical variables to numeric? Is there a problem? One of the biggest advantages of boosting is that categorical variables are always available. So gbm was able to use factor type variable too. Xgboost seems to have a drawback in this respect. right? – 신익수 Jul 12 '17 at 01:00
Yes -- xgboost only deals with numeric data. You can leave your. Categoricals as numeric and probably get ok results , or dummy encode them. Depending on the data , the results will vary. – T. Scharf Jul 12 '17 at 06:12
The model worked well. So how do I measure accuracy using test data? – 신익수 Jul 12 '17 at 11:20
please accept this above answer if you found it helpful -- then possibly open a new question if you would like! glad to help – T. Scharf Jul 12 '17 at 16:37
2

FYI: the fix in github.com/dmlc/xgboost/pull/2237 allows to directly use integer matrices as xgb.DMatrix input. However, it's not in CRAN yet. – Vadim Khotilovich Jul 13 '17 at 18:33
Thanks @Vadim -- didn't know this – T. Scharf Jul 13 '17 at 19:54

xgboost error message about numerical variable and label

1 Answers1