Why isn't stemDocument stemming?

Question

I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it.

Here is the script for the process, which uses a couple of online news stories as the sandbox:

library(boilerpipeR)
library(RCurl)
library(tm)

# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))

# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
  control = list(removePunctuation = TRUE,
                 stopwords = TRUE,
                 stripWhitespace = TRUE,
                 stemDocument = TRUE))

# Now inspect the result
findFreqTerms(news, 4)

Here is the output that last line produces:

[1] "acadine"       "adobe"         "android"       "browser"       "challenge"     "companies"     "company"       "devices"       "firefox"       "flash"        
[11] "funding"       "gong"          "hackers"       "international" "ios"           "like"          "million"       "mobile"        "mozilla"       "mozillas"     
[21] "new"           "online"        "operating"     "said"          "security"      "smartphones"   "software"      "startup"       "system"        "systems"      
[31] "tsinghua"      "unigroup"      "used"          "users"         "videos"        "web"           "will"

In line 1, for example, we see "companies" and "company", and we see "devices". I thought stemming would reduce "companies" and "company" to the same stem ("compani"?), and I thought it would trim the "s" off plurals like "devices". Am I wrong about that? If not, why isn't this code producing the desired result here?

Possibly useful http://stackoverflow.com/questions/7263478/snowball-stemmer-only-stems-last-word and a function: stemDocumentfix <- function(x){ PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' ')) } — lawyeR, Jul 15 '15 at 19:02

lukeA · Accepted Answer · 2015-07-16T04:39:45.533

2

Use stemming = TRUE or stemming = stemDocument instead of stemDocument = TRUE. (?termFreq shows that stemDocument is no valid control parameter.)

edited Jul 16 '15 at 04:39

answered Jul 15 '15 at 19:22

lukeA

53,097
5
97
100

Why isn't stemDocument stemming?

1 Answers1

Linked