1

I am working with a dataset that has approximately 150000 rows and 25 columns. The data consist of numerical and factor variables. Factor variables are both text and numbers and I need all of them. The depended variable is a factor with 20 levels. I am trying to build a model and feed it into a SVM using the kernlab package in R.

library(kernlab)
n<- nrow(x)
trainInd<- sort(sample(1:nrow(x), n*.8))
xtrain<- x[trainInd,]
xtest<- x[-trainInd,]
ytrain<- y[trainInd]
ytest<- y[-trainInd]
modelclass<- ksvm(x=as.matrix(xtrain), y=as.matrix(ytrain),
              scaled = TRUE, type="C-svc", kernel = "rbfdot",
              kpar="automatic", C=1, cross=0) 

Following the code, I get this error:

Error in if (any(co)) { : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In FUN(newX[, i], ...) : NAs introduced by coercion

The xtrain data frame looks like:

Length    Gender    Age    Day    Hour     Duration    Period
  5         1       80      5      11         20          3
 0.2        2       35      2      18         10          5    
 1.1        2       55      1      15         120         4

The Gender, Day, and Period variables are categorical (factors), where the rest is numerical.

I have gone through similar questions and been through my dataset as well, but I cannot identify any NA values or other mistakes.

I assume that I am doing something wrong with variable types, and particular the factors. I am unsure of how to use them, but I can't see something wrong. Any help of how to solve the error and possibly how to model factor together with numerical variables would be appreciated.

J.Con
  • 4,101
  • 4
  • 36
  • 64
Alex
  • 131
  • 2
  • 10
  • 2
    We can't help you unless you provide us with the entire `x` data frame, or provide a reproducible example that triggers the error. – CPak Jul 02 '17 at 22:24
  • @ChiPak Thanks for the comment. I have added some detail about my data. – Alex Jul 02 '17 at 23:11

2 Answers2

1

The reason for this error message is that the svm implementations by kernlab and e1071 cannot deal with features of data type factor.

The solution is to convert the predictors which are factors by one-hot-encoding. Then there are two cases:

Case 1: formula interface

The one-hot-encoding is done implicitly by using train(form = formula, ...).

Case 2: x,y interface

when using the format train(x = features, y = target, data = dataset, ...), you must explicitly perform the one-hot-encoding!

A simple way to do this is: features = model.matrix(features)

Agile Bean
  • 6,437
  • 1
  • 45
  • 53
0

I had the same problem with e1071 package in R. I solved it changing all variables to numeric instead of factor, except the decision variable (y), which can be either a factor (for classification tasks) or a numeric (for regression).

References:

CRAN Package 'e1071'

DRTorresRuiz
  • 131
  • 11