14

My question is related to this one regarding categorical data (factors in R terms) when using the Caret package. I understand from the linked post that if you use the "formula interface", some features can be factors and the training will work fine. My question is how can I scale the data with the preProcess() function? If I try and do it on a data frame with some columns as factors, I get this error message:

Error in preProcess.default(etitanic, method = c("center", "scale")) : 
  all columns of x must be numeric

See here some sample code:

library(earth)
data(etitanic)

a <- preProcess(etitanic, method=c("center", "scale"))
b <- predict(etitanic, a)

Thank you.

Community
  • 1
  • 1
mchangun
  • 9,814
  • 18
  • 71
  • 101

2 Answers2

22

It is really the same issue as the post you link to. preProcess works only on numeric data and you have:

> str(etitanic)
'data.frame':   1046 obs. of  6 variables:
 $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...

You can't center and scale pclass or sex as-is so they need to be converted to dummy variables. You can use model.matrix or caret's dummyVars to do this:

 > new <- model.matrix(survived ~ . - 1, data = etitanic)
 > colnames(new)
 [1] "pclass1st" "pclass2nd" "pclass3rd" "sexmale"   "age"      
 [6] "sibsp"     "parch"  

The -1 gets rid of the intercept. Now you can run preProcess on this object.

btw making preProcess ignore non-numeric data is on my "to do" list but it might cause errors for people not paying attention.

Max

topepo
  • 13,534
  • 3
  • 39
  • 52
  • 1
    I think we do need only two variables for pclass. (either "pclass1st, pclass2nd" or "pclass2nd, pclass3rd" or "pclass3rd, pclass1st"). Like in case of variable sex, we have considered only sexmale and discarded sexfemale. Correct me if it is not sufficient. – Sandeep Jan 07 '15 at 13:28
  • @topepo, I think the answer below does the ignoring of the to-do list. I would suggest to add some warnings for the people who wouldn't pay attention. – toto_tico Dec 13 '16 at 14:22
8

Here's a quick way to exclude factors or whatever you'd like from consideration:

set.seed(1)
N <- 20
dat <- data.frame( 
    x = factor(sample(LETTERS[1:5],N,replace=TRUE)),
    y = rnorm(N,5,12),
    z = rnorm(N,-5,17) + runif(N,2,12)
)

#' Function which wraps preProcess to exclude factors from the model.matrix
ppWrapper <- function( x, excludeClasses=c("factor"), ... ) {
    whichToExclude <- sapply( x, function(y) any(sapply(excludeClasses, function(excludeClass) is(y,excludeClass) )) )
    processedMat <- predict( preProcess( x[!whichToExclude], ...), newdata=x[!whichToExclude] )
    x[!whichToExclude] <- processedMat
    x
}

> ppWrapper(dat)
   x          y           z
1  C  1.6173595 -0.44054795
2  A -0.2933705 -1.98856921
3  C  1.2177384  0.65420288
4  D -0.8710374  0.62409408
5  D -0.4504202 -0.34048640
6  D -0.6943283  0.24236671
7  E  0.7778192  0.91606677
8  D  0.2184563 -0.44935163
9  C -0.3611408  0.26075970
10 B -0.7066441 -0.23046073
11 D -1.5154339 -0.75549761
12 D  0.4504825  0.38552988
13 B  1.5692675  0.04093040
14 C  0.4127541  0.13161807
15 D  0.5426321  1.09527418
16 B -2.1040322 -0.04544407
17 C  0.6928574  1.12090541
18 B  0.3580960  1.91446230
19 E  0.3619967 -0.89018040
20 A -1.2230522 -2.24567237

You can pass anything you want into ppWrapper and it will get passed along to preProcess.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • Nice answer! I think you should use the example given (instead of an artificial example which might be confusing). Basically, `library(earth); data(etitanic); ppWrapper(etitanic)` – toto_tico Dec 13 '16 at 14:16