SnowballC in R stems "many" and "only"

Question

I am using SnowballC to process a text document, but realize it stems words such as "many" and "only" even though they are not supposed to be stemmed.

> library(SnowballC)
> 
> str <- c("many", "only", "things")
> str.stemmed <- stemDocument(str)
> str.stemmed
[1] "mani"  "onli"  "thing"
> 
> dic <- c("many", "only", "online", "things")
> str.complete <- stemCompletion(str.stemmed, dic)
> str.complete
    mani     onli    thing 
      "" "online" "things"

You can see that after stemming, "many" and "only" became "mani" and "onli", which cannot be completed back with stemCompletion later on, since letters in "many" is not inclusive of "mani". Notice how "onli" gets completed to "online" instead of the original "only".

Why is that? Is that a way to fix this?

It's worth noting that the text processing functions you're using come from the `tm` package. `SnowballC` is irrelevant in this case, unless you get different results when the library is not attached to the namespace. — Alex A., Apr 28 '15 at 17:43

score 3 · Answer 1 · answered Apr 29 '15 at 17:09

Stemming is often executed as a set of rules from stripping all affixes--both derivational and inflectional--from a word, leaving its root. Lemmatization typically only removes inflectional affixes. Stemming is a much more aggressive version of lemmatization. Given what you want, it seems like you'd prefer lemmatization.

To compare the two, most lemmatizers are limited to a few rules for dealing with affixes to nouns and verbs in English---ed, -s, -ing, for example. There are a few irregular cases they have to handle, but with some training data, many are probably covered.

Stemmers are expected to dig deeper. As a result, the space of possible transformations they can make is bigger, so you're a lot more likely to end up with errors.

To see what's happening in your data, let's look at the specifics.

online -> onli: why on earth would this happen? Not totally sure on this one; there's probably some rule that tries to cater to words like medic-ine and medic-al, sub-mari-ne and mari-ne, imagi-ne and imagi-na-tion.

only -> onli, many -> mani: These seem particularly strange, but are probably more reasonable than the previous rule--especially in the context of dealing with verbs that end in -ed. If you're stemming the words denied, studied, modified, specified, you'll want them to be equivalent to their uninflected forms deny, study, modify, specify. You could have a rule to transform each verb into the uninflected form, but the authors here chose to make the roots the forms ending in -i. To ensure that these match, -y endings had to be transformed to -i as well.

With a lemmatizer, you might get more predictable results. Since they only remove inflectional affixes, you'd get only, many, online, and thing, as you wanted. Both a good stemmer and lemmatizer can work well, but the stemmer does more stuff and therefore has more room for error.

score 2 · Answer 2 · edited May 23 '17 at 12:14

That is how stemmers work. You've got a (smallish) set of rules that reduce most words to something resembling a canonical form (a stem), but not quite. There are many other corner cases you will find, so many in fact that I hesitate to call them corner cases, e.g.

many -> mani
other -> other
corner -> corner
cases -> case
in -> in
sentences -> sentenc

What you want is a lemmatiser. Have a look at this question for a more detailed explanation:

Stemmers vs Lemmatizers

SnowballC in R stems "many" and "only"

2 Answers2