0

I am trying to classify tweets in two categories (e.g., basketball and non-basketball). Obviously, the dataset is dynamic, i.e., the document collection is not fixed to a set of N documents (i.e., tweets): the dataset is dilating over and over while one crawls Twitter.

One thing that one should try to apply is the Naive Bayes classifier, which is widely used for text classification. An explanation is provided here. However, one doubt still remains.

I could compute the model starting from the training set (and stating that the vocabulary V is composed by the terms contained in the training set). Now, one could collect a new, unclassified tweet that contains terms that are not present in V (i.e., terms that did not appear in the training set). Is the Naive Bayes classifier still applicable?

Generalizing the question: can the Naive Bayes classifier be applied to those cases in which the vocabulary is not entirely known?

Thank you in advance.

Eleanore
  • 1,750
  • 3
  • 16
  • 33

2 Answers2

0

The easiest thing to do for words in the test set that are not in the training set is to just ignore them.

You could do fancier things like measure which class tends to have unseen/rare words. Or you could try to use word shaping to turn unseen words into more general/observed word classes (for example, treat all numbers the same).

Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
  • Unfortunately I don't know these "fancy" techniques. Do you have some references so as to learn more maybe? Thank you so much! – Eleanore Apr 25 '14 at 23:48
  • Moreover: does ignoring new words (as you suggest at first) affect a lot the result quality? – Eleanore Apr 25 '14 at 23:51
-1

can the Naive Bayes classifier be applied to those cases in which the vocabulary is not entirely known?

If words in the test set are not in the train set these would be given class-conditional probabilities of 0 during training. Because the Naive Bayes classifier involves a product over all words in a test document, a single word in the test doc that was not in the train set would lead to a zero probability of the document belonging to the class under consideration.

The trick that is applied (and I think you are asking for) is called (Laplace) smoothing: adding 1 to the counts of all terms in your test set. It is a default setting in many libraries, for example in Python's Scikit-Learn:

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

Keyb0ardwarri0r
  • 227
  • 2
  • 10