-2

Thanks for your interest and help.

I built a Kernel SVM classifier with 30,000 rows of the training dataset by software R.

I used around 2,000-word features to train the classifier. It worked very well.

But, when I am trying to apply the classifier to a new text dataset, the problem occurred.

Because the new text document-term matrix does not contain all 2000-word features in the classifier (columns).

Of course, I can build a classifier with a small number of word features. Then, it works on the new text data, but the performance is not that good.

How do you solve this problem?

So, how do you solve the problem that the new text dataset does not have all the word features in the SVM classifier?

1 Answers1

0

I asked a question and answer it myself for other users.

I may find the solution.

The problem is that the columns (word-features) in the DTM of the trainset and the unseen dataset are different.

So, when making a DTM of the unseen dataset, use word features of the trainset's DTM as a dictionary.

For example,

features <- trainset_dtm$dimnames$Terms

unseen_dtm <- DocumentTermMatrix(unseen_cropus, control = list(dictionary=features))

Finally, the columns in both dtm(train / unseen) are same. SO, SVM works on the unseen_dtm.