R: variable has different number of levels in the node and in the data

Question

I want to use bnlearn for a classification task with Naive Bayes algorithm.

I use this data set for my tests. Where 3 variables are continuous ()V2, V4, V10) and others are discrete. As far as I know bnlearn cannot work with continuous variables, so there is a need to convert them to factors or discretize. For now I want to convert all the features into factors. However, I came across to some problems. Here is a sample code

dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...

trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)

# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)

...

For this code I get an error message while calling predict()

'V1' has different number of levels in the node and in the data.

And when I remove that V1 from the training set, I get the same error for the V2 variable. However, error disappears when I do factorization dataSet [] <- lapply(dataSet, as.factor) and only than split it into training and test sets.

So which is the elegant solution for this? Because in real world applications test and train sets can be from different sources. Any ideas?

score 0 · Answer 1 · edited Feb 06 '19 at 22:29

The issue appears to be caused by the fact that my train and test datasets had different factor levels. I solved this issue by using the rbind command to combine the two different dataframes (train and test), applying as.factor to get the full set of factors for the complete dataset, and then slicing the factorized dataframe back into separate train and test datasets.

train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]

complete <- rbind(learn, test)    
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]

bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)

I hope this can be helpful.

But why should the test data have full representation of all levels that are in the training set? Shouldn't test data be allowed to have a subset of the factors in the training data? — Lewis Munene, Feb 27 '20 at 09:45

R: variable has different number of levels in the node and in the data

1 Answers1