"Class variable needs to be a factor" error for csv-read datasets

Question

I am looking to discretise continuous features in machine-learning datasets, in particular, using supervised discretisation. It turns out that r [has a package/method for this]1, great! But since I am not proficient in R I have some issues and I would greatly appreciate if you could help.

I get an error

class variable needs to be a factor.

I looked at an example online, and they do not seem to have this problem, but I do. Note that I do not quite understand the syntax V2 ~ ., other than that V2 should be a column name.

library(caret)
library(Rcpp)
library(arulesCBA)

filename <- "wine.data"
dataset <- read.csv(filename, header=FALSE)
dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")

R reports the following error:

Error in .parseformula(formula, data) : class variable needs to be a factor!

You may find the dataset wine.data here: https://pastebin.com/hvDbEtMN The first parameter of discretizeDF.supervised is a formula and that seems to be the problem.

Please help! Thank you in advance.

try data$V2<-as.factor(data$V2) ; it sets V2 as factor, "V2 ~ . "is your formula stating: V2 is the response to everything on the right side of tilde, (explanatory factors) , ". "mostly just means every column — user12256545, Apr 01 '20 at 07:38
Thank you for the clarification! That helped. Btw setting the column as a factor rather than integers solved the problem. — rusty_lurker, Apr 02 '20 at 19:57

score 2 · Accepted Answer · answered Apr 01 '20 at 09:26

As written in the vignette, this is meant to implement:

several supervised methods to convert continuous variables into a categorical variables (factor) suitable for association rule mining and building associative classifiers.

If you look at your V2 column, it's continuous:

test = read.csv("wine_dataset.txt",header=FALSE)
str(test)
'data.frame':   178 obs. of  14 variables:
 $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : num  14.2 13.2 13.2 14.4 13.2 ...
 $ V3 : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...

What you need is a target that is categorical, so that the algo can find suitable methods to discretize it for finding associations. For example:

#this cuts V2 into 4 categories according to where they fall in the range
test$V2 = factor(cut(test$V2,4,labels=1:4))
dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")

The above is one way to get around, but you need to find ways to cut V2 well. If you need to use the target as a continuous, then you can use discretizeDF from arules, and I also see that your first column is 1,2,3 only:

test = read.csv("wine_dataset.txt",header=FALSE)
test2 = data.frame(test[,1:2],discretizeDF(test[,-c(1:2)]))

Thank you for the answer! Indeed I needed a categorical target, but I got the same error when I used the appropriate columns. But, as you noted, the main problem was that the target column was not specified as a factor, which should/could have been done by cutting continuous data or using already integer column and converting it to a factor. I fixed that and it worked! — rusty_lurker, Apr 02 '20 at 19:58

"Class variable needs to be a factor" error for csv-read datasets

1 Answers1