I'm working with the caret
package, training a model for text classification, but I've faced a problem that bugs me and I'm not finding a proper solution.
I got a data.frame
of training like this:
training <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(1,1,1), result =c('good','good','bad'))
training
x y z result
1 0 0 1 good
2 0 1 1 good
3 1 0 1 bad
So I train my model like this:
library(caret)
svm_mod <- train(sent ~ .,df, method = "svmLinear")
# There were 42 warnings (use warnings() to see them) Some warnings, not the point of the question
Now let's skip the testing part, let's think that's ok.
Now I've the real work, i.e. predict unknown data. My problem is that the "applying" data can have different columns from the training
dataset, and predicting is not always permitted:
# if the columns are the same, it's ok
applying <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(0,1,1))
predict(svm_mod, applying)
# if the columns in applying are more than in train, it's ok
applying <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(0,1,1), k=c(1,1,1))
predict(svm_mod, applying)
# if in applying is missing a column that is in train it does not work:
applying <- data.frame(x = c(0,0,1),y = c(0,1,0))
predict(svm_mod, applying)
# Error in eval(predvars, data, env) : object 'z' not found
Now the solution should be to add all the missing column in training as 0s:
applying$z <- 0
in the applying
dataset, but I find it not so correct/nice. Are there a correct solution to do this? I've read several question about this (my favourite is this, my question is about finding a workaround about this issue).
My data are phrases, and I'm using document term matrix as inputs, in a production environment, this mean that's going to have newer input, without the columns in train.