stemDocument in R reduces some words too much. How to adjust for that?

Question

I encountered a problem with function stemDocument in R. As is shown in the following blocks, I use the function correctly and there is no special symbols in my docs. The code runs well with no warnings. However, some words in my text will be cut too much.

For instance, failure, variable, application, change, popular, will be transformed to failur, variabl, applic, chang, popul. I understand that it is because the function will transform words into their roots, but can we do something to make the results more readable when we want to present them to others (for example, by a word could figure)?

I know that one can complete the roots by stemCompletion function, but we still need to specify a relevant dictionary manually, which is tedious if there are too many words involved.

I was wondering that if there is some way that we can transform words with the same root into a single one like stemDocument, but the result is not as simple as a root: it should be a word (for example, the most frequently occured words in the document). I would really appreciate it if anyone can share me some ideas.

docs <- Corpus(VectorSource(docs))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Text stemming
docs <- tm_map(docs, PlainTextDocument) # not necessary
docs <- tm_map(docs, stemDocument)

Use the udpipe R package (https://cran.r-project.org/web/packages/udpipe/index.html). It allows you to do lemmatisation. Lemmatisation is what you are looking for, not stemming. — , Jun 11 '18 at 18:30

stemDocument in R reduces some words too much. How to adjust for that?

0 Answers0