0

How can I use dummy vars in caret without destroying my target variable?

set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)

dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)

will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • you can just do `data2$Purchase <- data$Purchase` afterwards can't you? – mtoto Nov 18 '16 at 13:26
  • I tried that - but this seems to distort the result of the matrix. Is it possible to pass the dummyVars from caret directly into the train? as a pipeline? – Georg Heiler Nov 18 '16 at 13:28

1 Answers1

4

At least the example code seems to have a few issues indicated in the comments below. To answer your questions:

  • The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
  • Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula

To avoid these problems, check the class of your objects carefully.

Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().

set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)

# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)

library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables, 
# before turning the categorical variables into dummies

split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))

# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code

# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))

modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))

# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables 

train <- data[train_idx, ]
test <- data[-train_idx, ]

preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)

# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))

# Reattach the target variable to the training data that has been 
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')
Johannes
  • 503
  • 3
  • 15
  • are you sure that the preProcessing would not be also applied to the categorical variables (that now are dummy variables 1/0)? – PepitoDeMallorca Apr 02 '19 at 16:29
  • @PepitoDeMallorca That's a valid concern, although not part of the OP's problem. I've updated Option 2 to provide a solution that avoids this – Johannes Apr 03 '19 at 17:55