15

I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.

I've got an example using the Iris data

df <- iris

y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)

df <- df[,1:4]

y <- as.matrix(y)
df <- as.matrix(df)

param <- list("objective" = "multi:softprob",    
          "num_class" = 3,    
          "eval_metric" = "mlogloss",    
          "nthread" = 8,   
          "max_depth" = 16,   
          "eta" = 0.3,    
          "gamma" = 0,    
          "subsample" = 1,   
          "colsample_bytree" = 1,  
          "min_child_weight" = 12)

model <- xgboost(param=param, data=df, label=y, nrounds=20)

This returns an error

Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) : 
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label

If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.

I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.

House
  • 195
  • 1
  • 1
  • 5

4 Answers4

9

label must be in [0, num_class) in your script add y<-y-1 before model <-...

RustamA
  • 91
  • 3
  • I tried adding `y – House Mar 20 '16 at 03:00
  • 2
    if you have 3 classes num_class=3 and classes start from 0 – RustamA Mar 20 '16 at 06:24
  • You have error coz y in your script is character. y <- as.numeric(y)-1 – RustamA Mar 20 '16 at 06:33
  • 1
    I can't believe that solved it. From this fix it looks like xgboost only accepts factors that are sequentially numbered and that the numbering starts from 0. This means that if there are n levels then the number of the last factor will always be n-1. – House Mar 21 '16 at 08:59
  • This worked for me n that it got my xgb model to run. As it's running I'm wondering if this implies I have to make a similar change on the actual predictions. I'm predicting on 100 levels within target class. Now that I used y <- y - 1, do I have to add 1 to my predicted class? So if my fist prediction is factor 10, then I should change it to 11? – Doug Fir Sep 05 '17 at 12:09
4

I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.

First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting num_class = N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.

I re-encoded the labels as integers and then num_class worked fine when set to N.

# Convert classes to integers for xgboost
class <- data.table(interest_level=c("low", "medium", "high"), class=c(0,1,2))
t1    <- merge(t1, class, by="interest_level", all.x=TRUE, sort=F)

and

param <- list(booster="gbtree",
              objective="multi:softprob",
              eval_metric="mlogloss",
              #nthread=13,
              num_class=3,
              eta_decay = .99,
              eta = .005,
              gamma = 1,
              max_depth = 4,
              min_child_weight = .9,#1,
              subsample = .7,
              colsample_bytree = .5
)

For example.

Hack-R
  • 22,422
  • 14
  • 75
  • 131
4

I was seeing the same error, my issue was that I was using an eval_metric that was only meant to be used for multiclass labels when my data had binary labels. See eval_metric in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.

Vito
  • 1,580
  • 1
  • 17
  • 32
0

I had this problem and it turned out that I was trying to subtract 1 from my predictor which was already in the units of 0 and 1. Probably a novice mistake, but in case anyone else is running into this with a binary response variable that is already 0 and 1 it is something to make note of.

Tutorial said:

label = as.integer(iris$Species)-1

What worked for me (response is high_end):

label = as.integer(high_end)
Dulaj Kulathunga
  • 1,248
  • 2
  • 9
  • 19