0

Below is how I stem my Corpus and my documents. However, for example "work" and "worked" show up a large amount of the time and these are obviosuly the same word for all intents and purposes in my analysis. Is there a package or some code snippet to removed the"-ed" ending? Thanks!

library(tm)
docs<-Corpus(DirSource(cname))

summary(docs)

library(SnowballC)   
docs <- tm_map(docs, stemDocument) 
agunner
  • 55
  • 7
  • Do you want to convert worked to work or just want to remove ed in the word worked? – Saurabh Chauhan Mar 01 '17 at 02:34
  • That's a good question. I think I want them to be the same word. Ideally, I want all "ed"s from words to be removed so if the issue pops up again I get no double counting of a root word – agunner Mar 01 '17 at 02:50

1 Answers1

2

That is a more complex question that you might think.

If you use stemming, then the ed's will be removed from a word without regard for the meaning or context of the word. So you could reduce many past tense words to their root word, or plurals to the singular.

However, you can lose context doing this also. The true root of the word, the lexeme, has a meaning of its own and it sometimes is lost in stemming because different words evolve from the same root.

Imagine you stemmed and removed the s's in plurals:

So in this sentence... "She walks slowly."

and this sentence... "They came from all walks of life."

...you get the word walk.

Although they evolved from the same root word, they have different lexical meanings and stemming the second version creates a contextual mismatch.

In this case lemmatization would be a better choice (if the algorithm was solid and appropriate to your corpus), because it would preserve the underlying meaning of the lexeme behind the apparent sameness of the two different words.

Lemmatization is different than stemming in that is uses context to try to decide what the meaning of the root is, its lexeme, whereas stemming just trims back to the assumed root.

For really sensitive uses, it may be necessary. But it is also often-times no more accurate in a large corpus if not masterfully handled.

If context matters, try the Wordnet lemmatization package:

Wordnet for R

If all you need is stemming, try using snowball in its simplest form to see if it gets you what you want:

docsStemmed<-wordStem(docs, language = "english")

from the "SnowballC" package, be aware your document must be in a character vector to stem this way returning another vector of stemmed words. It should remove the past tense endings. You can use it with tm as you have shown above.

Likely if you are not getting the results you want with that method, you need to groom the corpus more before stemming.

  • Reduce it to lowercase.
  • Remove punctuation.
  • Convert to plain text.
  • Purge emojis and any odd non-conforming symbols.

Once you get the document structured right, stemming is much more reliable. If you need help with tm & SnowballC try sifting through the methods here and searching the stacks for clarity with these methods.:

tm & SnowballC docs

sconfluentus
  • 4,693
  • 1
  • 21
  • 40