0

I'm going to perform xgboost on R using xgb.train function.

In order to use the xgb.train function, I know that input data must be transformed as using xgb.DMatrix function.

But when I used this function in my data setm I got an error message :

Error in xgb.DMatrix(data = as.matrix(train)) : 
  [09:01:01] amalgamation/../dmlc-core/src/io/local_filesys.cc:66: LocalFileSystem.GetPathInfo 1 Error:No such file or directory

Following is my full R code. To use input data, How to transform input data?

credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) credit[,i]=as.factor(credit[,i])
str(credit)


library(caret)
set.seed(1000)
intrain<-createDataPartition(y=credit$Creditability, p=0.7, list=FALSE) 
train<-credit[intrain, ]
test<-credit[-intrain, ]

d_train<-xgb.DMatrix(data=as.matrix(train))
이순우
  • 79
  • 1
  • 1
  • 10
  • Apparently, this error comes from having non numeric variables in train (see this [question](https://stackoverflow.com/questions/38186478/peculiar-installation-warning-causing-packages-to-malfunction)). You can add `read.csv(..,colClasses="numeric")` and remove the lines where you turn some variables into factors and it should work. – Lamia Aug 11 '17 at 00:39
  • @Lamia Should I use only numeric variable? Then, how to use factor type variable ? – 이순우 Aug 11 '17 at 02:08
  • Yes, `xgb.DMatrix` takes as input only numeric variables, you shouldn't transform them into factors. – Lamia Aug 11 '17 at 02:09
  • If you intend to use all variables - create dummy variables for your categorical variables after pulling in the data . Use `dummies` package in r function `dummy.data.frame` – Learner_seeker Aug 11 '17 at 03:38
  • @Pb89 If so, can not use the raw factor type variable, do I have to do one hot encoding? – 이순우 Aug 11 '17 at 04:08
  • Yes , one hot encoding. From a modelling standpoint also its better. You'll be able to see relevance or predictive nature of these variables at a categorical level. – Learner_seeker Aug 11 '17 at 04:47
  • All models whether linear, or tree based (random forests or gradient boosting), etc. require an all numeric X matrix and numeric Y vector to do the computations. In most other R functions the formula syntax converts your data frame into an X matrix and Y vector automatically, so it can compute. xgboost however requires that you pass it an X matrix and Y vector, so you have to do the conversion yourself (one hot encoding aka. creating dummies). – Mark Nielsen Oct 08 '17 at 05:10

1 Answers1

0

If you still want to use factors you should use the model.matrix() function to convert your factors to dummy variables.

For example:

my.dat <- mtcars[c("mpg","cyl","disp")]
my.dat$cyl <- as.factor(my.dat$cyl)
# Convert data frame to X matrix
x.train <- model.matrix(mpg~.,data=my.dat)
head(x.train)

Output:

                  (Intercept) cyl6 cyl8 disp
Mazda RX4                   1    1    0  160
Mazda RX4 Wag               1    1    0  160
Datsun 710                  1    0    0  108
Hornet 4 Drive              1    1    0  258
Hornet Sportabout           1    0    1  360
Valiant                     1    1    0  225

This creates dummy variables cyl6 and cyl8 where 4 cylinder vehicles would be the base group (where cyl6=0 and cyl8=0).

Then you can pass this matrix into the xgb.DMatrix function:

d_train<-xgb.DMatrix(x.train,label=my.dat$mpg)
Mark Nielsen
  • 991
  • 2
  • 10
  • 28