2

I m using xgboost for regression problem but I m getting error regarding response variable which is output sales and it is initially numeric in class but as I use xgboost it shows error BUT I want output in numeric form only

labels <- train$Item_Outlet_Sales# train label
ts_label <- test$Item_Outlet_Sales  # test label

# converted into matrix ( one hot encoding )
new_tr <- model.matrix(~.+0,data = train[,-c("Item_Outlet_Sales"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("Item_Outlet_Sales"),with=F])

## checking class
class(labels)
[1] "numeric"

I have created label or response variable in test as test$Item_Outlet_Sales <- NA

class(test$Item_Outlet_Sales)
[1] "logical"

# coverting `ts_label` into numeric as it initially is logical
ts_label <- as.numeric(ts_label)-1
class(ts_label)
[1] "numeric"

now

 dtrain1 <- xgb.DMatrix(data = new_tr,label = labels) 
 dtest1 <- xgb.DMatrix(data = new_ts,label= ts_label)

 xgbmodel1 = xgb.train(data=dtrain1, nround=150, max_depth=5, eta=0.1,  subsample=0.9, 
                       objective="reg:logistic", booster="gbtree", eval_metric="rmse")

Error -

Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  [14:08:41] amalgamation/../src/objective/regression_obj.cc:108: 
  label must be in [0,1] for logistic regression

I used then this:

xgbmodel1 = xgb.train(data=dtrain1, nround=150, max_depth=5, eta=0.1,  subsample=0.9, 
                      objective="reg:linear", booster="gbtree", eval_metric="rmse")

I got all values of response variable equal to -1 and my rmse score is infinite..

Please tell me how to implement xgboost effectively in this case even with default conditions so no error comes.

I have 4 categorical variables in this dataset.

here is a subset of train dataset

sure, r <- train[1:3,]

r

Item_Identifier Item_Fat_Content Item_Type Item_MRP Outlet_Identifier 1: FDA15 Low Fat Dairy 249.8092 OUT049 2: DRC01 Regular Soft Drinks 48.2692 OUT018 3: FDN15 Low Fat Meat 141.6180 OUT049 Outlet_Establishment_Year Outlet_Location_Type Outlet_Type Item_Outlet_Sales 1: 1999 Tier 1 Supermarket Type1 3735.1380 2: 2009 Tier 3 Supermarket Type2 443.4228 3: 1999 Tier 1 Supermarket Type1 2097.2700 Item_Weight Item_Visibility Outlet_Size 1: 9.30 0.01604730 2 2: 5.92 0.01927822 2 3: 17.50 0.01676007 2

jatin singh
  • 123
  • 1
  • 1
  • 13

1 Answers1

3

I see two problems here:

  1. The algorithm expects labels to be either 0s or 1s. On the contrary your code sets them to the value 0 or -1. Correct the line where you define the ts_label variable as follows:

    ts_label <- as.numeric(ts_label)
    
  2. You have a binary target and categorical predictors. Why do you want to do logistic regression? I feel "binary:logistic" may be a better objective here. "reg:linear" makes no sense and your loss function should be based on accuracy and not rmse.

gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
Damiano Fantini
  • 1,925
  • 9
  • 11
  • i dont have binary target , it a regression problem and i have to tell the total output sales for each item but this i m applying xgboost first time to increase accuracy. Please tell me as to how make changes to implement for regression . – jatin singh Aug 26 '17 at 12:50
  • This is a practice problem and metric used for evaluation is rmse and so i wanted to set it according to it so could know what effect parameter tuning can have – jatin singh Aug 26 '17 at 12:53
  • I mean, you mentioned that your "label" variable is logical, which means TRUE and FALSE. I don't see how that can be used for regression. – Damiano Fantini Aug 26 '17 at 13:00
  • at the top i am mentioning the error that it is asking for label to be 0 and 1.and I only added an response column to test data as it was not there by test$Item_Output_Sales <- NA.....i think error is with one hot encoding lines as there is not enough examples to apply for regression. – jatin singh Aug 26 '17 at 13:04
  • can you share your data or a part of it? – Damiano Fantini Aug 26 '17 at 13:31
  • sure, r <- train[1:3,] – jatin singh Aug 26 '17 at 15:00