That is a more complex question that you might think.
If you use stemming, then the ed's will be removed from a word without regard for the meaning or context of the word. So you could reduce many past tense words to their root word, or plurals to the singular.
However, you can lose context doing this also. The true root of the word, the lexeme, has a meaning of its own and it sometimes is lost in stemming because different words evolve from the same root.
Imagine you stemmed and removed the s's in plurals:
So in this sentence...
"She walks slowly."
and this sentence...
"They came from all walks of life."
...you get the word walk.
Although they evolved from the same root word, they have different lexical meanings and stemming the second version creates a contextual mismatch.
In this case lemmatization would be a better choice (if the algorithm was solid and appropriate to your corpus), because it would preserve the underlying meaning of the lexeme behind the apparent sameness of the two different words.
Lemmatization is different than stemming in that is uses context to try to decide what the meaning of the root is, its lexeme, whereas stemming just trims back to the assumed root.
For really sensitive uses, it may be necessary. But it is also often-times no more accurate in a large corpus if not masterfully handled.
If context matters, try the Wordnet lemmatization package:
Wordnet for R
If all you need is stemming, try using snowball in its simplest form to see if it gets you what you want:
docsStemmed<-wordStem(docs, language = "english")
from the "SnowballC" package, be aware your document must be in a character vector to stem this way returning another vector of stemmed words. It should remove the past tense endings. You can use it with tm
as you have shown above.
Likely if you are not getting the results you want with that method, you need to groom the corpus more before stemming.
- Reduce it to lowercase.
- Remove punctuation.
- Convert to plain text.
- Purge emojis and any odd non-conforming symbols.
Once you get the document structured right, stemming is much more reliable. If you need help with tm
& SnowballC
try sifting through the methods here and searching the stacks for clarity with these methods.:
tm & SnowballC docs