0
  df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
                  var1 = c('a', 'b', 'c', 'd', 'e'),
                  var2 = c(1, 1, 0, 0, 1))

  ada = boosting(formula=var1~., data=df1)

Error in cbind(yval2, yprob, nodeprob) : 
  el número de filas de las matrices debe coincidir (vea arg 2)

Hi everyone, I'm trying to use boosting function from adabag package, but it's telling me that the number of rows from matrix (?) must be equal. This data is not the original, but it seems to throw the same error.

Could you help me?

Thank you.

2 Answers2

0

You should not use ID as explanatory variable.
Unfortunately your df1 dataset is too small and it is not possibile to understand if ID is the source of your problem.
Below I generate a bigger data set:

library(adabag)
set.seed(1)
n <- 100
df1 <- data.frame(ID = 1:n,
                  var1 = sample(letters[1:5], n, replace=T),
                  var2 = sample(c(0,1), n, replace=T))
head(df1)
#   ID var1 var2
# 
# 1  1    b    1
# 2  2    b    0
# 3  3    c    0
# 4  4    e    1
# 5  5    b    1
# 6  6    e    0

ada <- boosting(var1~var2, data=df1)

ada.pred <- predict.boosting(ada, newdata=df1)
ada.pred$confusion
# Observed Class Predicted Class  a  b  c  d  e
#               b  5 20  2  7 11
#               c  2  2 10  2  2
#               d  6  3  7 17  4
Marco Sandri
  • 23,289
  • 7
  • 54
  • 58
  • My original data has 14 observations with 113 (I proved with all of them, and also with 10 of them) variables that are all numeric except from the Class that is factor with levels (-1,1) and still crashing, but function doesn't return any clue beyond the above commented. So I don't know hot to procced from here, but I thank you a lot your testing. – Pablo Lozano Feb 13 '18 at 11:26
  • Could you share your 14 x 113 dataset ? – Marco Sandri Feb 13 '18 at 11:34
0

Pablo, if we have a closer look at your sample data, we will notice a property that makes it impossible for the classification algorithm to handle. Your dataset consists of five samples, each having a unique label i.e. the dependent variable: a, b, c, d, e. The dataset has only one feature (i.e. independent variable var2, as ID should be excluded from the features’ list) consisting of two classes: 0 and 1. It means there are several labels (of the dependent variable) that correspond to the same class of the independent variable. When algorithm tries to build a model, in this process it encounters a problem with defining regression due to the previously described dataset property and throws the error (number of rows of matrices must match (see arg 2)).

Marco's data, instead, has some healthy diversity: in the dataset of six samples, there are only three labels (b, c, e) and two classes (0, 1). The data set is diverse and reliable enough for the algorithm to handle it.

So, in order to use adabag’s boosting (that uses a regression tree called rpart as the control), you should make your data more diverse and reliable. Good luck!

Alex
  • 37
  • 1
  • 9