I am trying to classify tweets in two categories (e.g., basketball
and non-basketball
). Obviously, the dataset is dynamic, i.e., the document collection is not fixed to a set of N
documents (i.e., tweets): the dataset is dilating over and over while one crawls Twitter.
One thing that one should try to apply is the Naive Bayes classifier, which is widely used for text classification. An explanation is provided here. However, one doubt still remains.
I could compute the model starting from the training set (and stating that the vocabulary V
is composed by the terms contained in the training set). Now, one could collect a new, unclassified tweet that contains terms that are not present in V
(i.e., terms that did not appear in the training set). Is the Naive Bayes classifier still applicable?
Generalizing the question: can the Naive Bayes classifier be applied to those cases in which the vocabulary is not entirely known?
Thank you in advance.