Text Classification - using stemmer degrades results?

Question

There's this article about sentiment analysis of Arabic.

In the beginning of page 5 it says that:

"Experiments also show that stemming words before feature extraction and classification nearly always degrades the results".

Later on in the same page, they state that:

"...and an Arabic light stemmer is used for stemming the words"

Um I thought that a stemmer/lemmatizer was always used before text classifications, why does he say that it degrades the results?

Thanks :)

score 5 · Accepted Answer · answered Jan 22 '14 at 22:13

5

I do not know the arabic language, it may be specific in many aspects, my answer regards english.

Um I thought that a stemmer/lemmatizer was always used before text classifications, why does he say that it degrades the results?

No it is not, in entirely depends on the task. If you want to extract some general concept of the text, then stemming/lematization is a good step. But in analysis of short chunks, where each word is valuable, stemming simply destroys its meaning. In particular - in sentiment analysis stemming may destroy the sentiment of the word.

answered Jan 22 '14 at 22:13

lejlot

64,777
8
131
164

Hi @lejlot :) First thing: thanks. A question: "In analysis of short chunks, stemming simply destroys the meaning"?! O_O Can you please provide an explanation/example/source? And I gave the second quote to show that at the end they _did_ use the stemmer... even though they shouldn't have? (BTW: I don't know Arabic either, but I guess the major difference is in it's being an _very_ inflected language). – Cheshie Jan 22 '14 at 22:25
Stemming is just a set of rules of shortening the word, which can lose its meaning in the process. What is so surprising about that? Any manipulation that reduces amount of data - reduces amount of information.Consider a Lancaster stemmer and words: hard ,harder ,hardening ,hards; which have completely different meanings in english, that all have **the same stem** "hard", which makes this process loose lots of information. – lejlot Jan 22 '14 at 22:31
regarding quotation - I did not read the paper, as it is far from being interesting for me; but first, they state that it *nearly always* reduce results, not *always*, maybe in their case it didn't happend, so they can stem, and they state that they use **light** stemmer which might be "light" in the sense of lost meaning. For example Wordnet lemmatizer is much lighter than lancaster stemmer. – lejlot Jan 22 '14 at 22:33
...And yet it's still being used? I guess it doesn't mean much, but I've read quite a lot and I can't remember seeing any text classification done without stemming. Have you...? Thanks again @lejlot – Cheshie Jan 22 '14 at 22:34
Yes, my comment states, that it may be **the case** there, again - **light** may be the crucial point here. And yes, I've seen dozens of such (classification without stemming), especially a modern approaches based on more advanced models than simple bag of words representation, but SO is not the place for such discussions. – lejlot Jan 22 '14 at 22:36
OK... if you have a link of such an article or something like that (of a bag of words model, preferably, without stemming), I'd really appreciate it. Thanks @lejlot :) – Cheshie Jan 22 '14 at 22:41

Text Classification - using stemmer degrades results?

1 Answers1