6

I'm new using R and I'm trying to build a decision tree. I've already used the package party for ctree and rpart for rpart.

But, as I needed to do cross validation for my model I start using the caret package since I'm able to do that by using the function `train() and the method I want to use.

library(caret)
cvCtrl <- trainControl(method = "repeatedcv", repeats = 2,
                   classProbs = TRUE)

ctree.installed<- train(TARGET ~ OPENING_BALANCE+ MONTHS_SINCE_EXPEDITION+
                    RS_DESC+SAP_STATUS+ ACTIVATION_STATUS+ ROTUL_STATUS+ 
                    SIM_STATUS+ RATE_PLAN_SEGMENT_NORM,
                    data=trainSet,
                    method = "ctree",
                    trControl = cvCtrl)

However, my variables OPENING_BALANCE and MONTHS_SINCE_EXPEDITION have some missing values and the function doesn't work because of that. I don't understand why this happens since I'm trying to build a tree. This problem doesn't occur when i'm using the other packages.

This is the error:

Error in na.fail.default(list(TARGET = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,  : 
missing values in object

I didn't want to use na.action=pass since I really don't want to discard those observations.

Am I doing something wrong? Why is this happening? Do you have any suggestions for this?

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • `na.action = na.pass` doesn't discard na's, it passes them on which means that if you use a `predict` function that doesn't support `NA` it will fail. `na.action = na.omit` _would_ discard those observations. – Janna Maas Apr 27 '17 at 11:42
  • did you find this? it may be helpful too: https://stats.stackexchange.com/questions/144922/r-caret-and-nas – Janna Maas Apr 27 '17 at 11:43
  • Thank you for your answer. The problem I find is that when I use the predict function, the result returns a lot less observations than the ones I gave on the test set. Lets assume the test set has 30000 observations, I'm only receving the prediction for 20000 since 10000 of them have missing values on the input variables. – Carolina Leana Santos Apr 29 '17 at 09:34

1 Answers1

7

I start considering the dataset PimaIndiansDiabetes2 of the mlbench package which has some missing values.

data(PimaIndiansDiabetes2, package = "mlbench")
head(PimaIndiansDiabetes2)

  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35      NA 33.6    0.627  50      pos
2        1      85       66      29      NA 26.6    0.351  31      neg
3        8     183       64      NA      NA 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74      NA      NA 25.6    0.201  30      neg

In train I set na.action to na.pass (which leads to return the dataset unchanged) and then set the maxsurrogate parameter in ctree:

library(caret)
cvCtrl <- trainControl(method="repeatedcv", repeats = 2, classProbs = TRUE)
set.seed(1234)
ctree1 <- train(diabetes ~ ., data=PimaIndiansDiabetes2,
                    method = "ctree",
                    na.action  = na.pass,
                    trControl = cvCtrl,
                    controls=ctree_control(maxsurrogate=2))

The results is:

print(ctree1)
Conditional Inference Tree 

392 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times) 
Summary of sample sizes: 691, 692, 691, 691, 691, 691, ... 
Resampling results across tuning parameters:

  mincriterion  Accuracy   Kappa    
  0.01          0.7349111  0.4044195
  0.50          0.7485731  0.4412557
  0.99          0.7323906  0.3921662

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mincriterion = 0.5.
Marco Sandri
  • 23,289
  • 7
  • 54
  • 58
  • Hi, thanks for your answer :) why did you set the maxsurrogate parameter? – Carolina Leana Santos Apr 27 '17 at 16:40
  • 1
    My aim was to show how to pass `ctree` parameters inside `train`. In addition, `maxsurrogate` is an important parameter when there are missing values (it must be set to a positive value). – Marco Sandri Apr 27 '17 at 17:00
  • why? sorry i'm a newbie xD – Carolina Leana Santos Apr 28 '17 at 08:17
  • In section 5.2 of this document https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf there is a brief but clear explanation of the use of surrogate variables and surrogate splits for handling missing values in CART. (If my suggestions are helpful for the solution of your problem, please consider to upvote my answer above: http://stackoverflow.com/help/privileges/vote-up ) – Marco Sandri Apr 28 '17 at 11:19