R Caret, how to have the same features in training and applying dataset

Question

I'm working with the caret package, training a model for text classification, but I've faced a problem that bugs me and I'm not finding a proper solution.

I got a data.frame of training like this:

training <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(1,1,1), result =c('good','good','bad'))
training
  x y z result
1 0 0 1   good
2 0 1 1   good
3 1 0 1    bad

So I train my model like this:

library(caret)
svm_mod <- train(sent ~ .,df,  method = "svmLinear")
# There were 42 warnings (use warnings() to see them)  Some warnings, not the point of the question

Now let's skip the testing part, let's think that's ok.

Now I've the real work, i.e. predict unknown data. My problem is that the "applying" data can have different columns from the training dataset, and predicting is not always permitted:

# if the columns are the same, it's ok
applying <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(0,1,1))
predict(svm_mod, applying)

# if the columns in applying are more than in train, it's ok
applying <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(0,1,1), k=c(1,1,1))
predict(svm_mod, applying)

# if in applying is missing a column that is in train it does not work:
applying <- data.frame(x = c(0,0,1),y = c(0,1,0))
predict(svm_mod, applying)
# Error in eval(predvars, data, env) : object 'z' not found

Now the solution should be to add all the missing column in training as 0s:

applying$z <- 0

in the applying dataset, but I find it not so correct/nice. Are there a correct solution to do this? I've read several question about this (my favourite is this, my question is about finding a workaround about this issue).

My data are phrases, and I'm using document term matrix as inputs, in a production environment, this mean that's going to have newer input, without the columns in train.

Does this answer your question? [How to recreate same DocumentTermMatrix with new (test) data](https://stackoverflow.com/questions/16630627/how-to-recreate-same-documenttermmatrix-with-new-test-data) — missuse, Feb 27 '20 at 12:26
@missuse I'm using tm package. I'm going to look at the suggested question, thanks a lot. — s__, Feb 27 '20 at 12:39
I am glad if I can help. I suggest using text2vec as per its creators suggestion (Dmitriy Selivanov - 2nd answer in the linked question). — missuse, Feb 27 '20 at 12:47

R Caret, how to have the same features in training and applying dataset

0 Answers0