3

I am implementing a text classification system using Mahout. I have read stop-words removal and stemming helps to improve accuracy of Text classification. In my case removing stop-words giving better accuracy, but stemming is not helping much. I found 3-5% decrease in accuracy after applying stemmer. I tried with porter stemmer and k-stem but got almost same result in both the cases.

I am using Naive Bayes algorithm for classification.

Any help is greatly appreciated in advance.

GS Majumder
  • 999
  • 6
  • 8
  • What kind of text and what classes do you use? – ffriend Mar 24 '14 at 07:29
  • Texts are from different websites. Mostly unstructured texts. Classes are like sports, business, government etc. – GS Majumder Mar 24 '14 at 07:34
  • 1
    Which pre-processing steps to use before the text classification; vary corpus to corpus...You can try using Lematization on your corpus instead of Stemming to see if it increases the accuracy.. Stemming Algorithm produces the Stem words of the given words while Lematization is a detailed process, which even looks at the meaning of the generated word; and if you are using the bag of words algorithm then Stemming is supposed to increase the accuracy...Stemming caused a problem in my application when I was trying for Word-Sense-Disambiguation..which is obvious. – sumitb.mdi Mar 24 '14 at 07:41
  • 1
    There is no definitive answer to this question. Some preprocessing steps will help some times, some will help almost never, and there's no real guarantees because there are so many other variables (method/feature selection/dimensionality reduction/etc). The only way to find out whether something works for your problem is to try it – Ben Allison Mar 24 '14 at 15:47

1 Answers1

6

First of all, you need to understand why stemming normally improve accuracy. Imagine following sentence in a training set:

He played below-average football in 2013, but was viewed as an ascending player before that and can play guard or center.

and following in a test set:

We’re looking at a number of players, including Mark

First sentence contains number of words referring to sports, including word "player". Second sentence from test set also mentions player, but, oh, it's in plural - "players", not "player" - so for classifier it is a distinct, unrelated variable.

Stemming tries to cut off details like exact form of a word and produce word bases as features for classification. In example above, stemming could shorten both words to "player" (or even "play") and use them as the same feature, thus having more chances to classify second sentence as belonging to "sports" class.

Sometimes, however, these details play important role by themselves. For example, phrase "runs today" may refer to a runner, while "long running" may be about phone battery lifetime. In this case stemming makes classification worse, not better.

What you can do here is to use additional features that can help to distinguish between different meanings of same words/stems. Two popular approaches are n-grams (e.g. bigrams, features made of word pairs instead of individual words) and part-of-speech (POS) tags. You can try any combination of them, e.g. stems + bigrams of stems, or words + bigrams of words, or stems + POS tags, or stems, bigrams and POS tags, etc.

Also, try out other algorithms. E.g. SVM uses very different approach than Naive Bayes, so it can catch things in data that NB ignores.

ffriend
  • 27,562
  • 13
  • 91
  • 132