0

I have an unbalanced train data set and now I want to put some weight on my minority class ("bad") which is to be predicted and then put the weight into the rpart commado:

My data frame looks sth like this:

> head(train)
   case     V1      V2         V3        V4
1  bad        a      LL         AUT       1
2 good        b      LL         AUT       3
3 good        b      LL         AUT       2
4 good        b      LL         MAN       1
5 good        c      RL         AUT       2
6 good        b      LL         AUT       3

Now put weight on my "bad" cases:

caseweights <- train$case[train$case == "bad"]
> tree <- rpart(train$case ~ ., train, 
+               method = "class", 
+               minsplit =1, minbucket=1, maxdepth=3, 
+               parms = list(split = "gini"), 
+               cp=-1, weight = caseweights)

But it gives me this error:

Error in model.frame.default(formula = train$case ~ ., data = train, : Variablenlängen sind unterschiedlich (gefunden für '(weights)')

It's german and basically saying that the lengths of the variables are different ( found for '(weights)'....

So I go have a look how long my data sets are:

> nrow(train)
[1] 11525
> nrow(caseweights)
NULL                       #  <---------- Why NULL?

When I have a look at caseweigths, I can see a vector with ~ 420 entries of "bad"... Where am I thinking wrong?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
pineapple
  • 169
  • 9
  • 2
    The `weight` argument should be a vector with the same length as the number of rows of your data. So, there should be a weight for *every* observation. You've selected a vector whose length only corresponds to the "bad" cases. It also doesn't appear to be numeric, which would be a natural thing for weights to be. – joran Oct 09 '18 at 14:50
  • 1
    Your vector (`caseweights <- train$case[train$case == "bad"]`) doesn't have rows. You should use `length(caseweights)` instead. Also, your formula should be `case ~ .`, as you've already provided train dataset to build your model. – AntoniosK Oct 09 '18 at 14:55
  • 2
    An example of how you might generate a weight vector to upweight the "bad" observations might be `caseweights <- ifelse(train$case == "bad",0.75,0.25)`, where I choose the weights arbitrarily. – joran Oct 09 '18 at 14:56
  • Thank you to the both of you for your answers! @joran just one question: you put a weight of 0.75 on the "bad" oberservations ? – pineapple Oct 10 '18 at 06:49

0 Answers0