I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it.
Here is the script for the process, which uses a couple of online news stories as the sandbox:
library(boilerpipeR)
library(RCurl)
library(tm)
# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))
# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
control = list(removePunctuation = TRUE,
stopwords = TRUE,
stripWhitespace = TRUE,
stemDocument = TRUE))
# Now inspect the result
findFreqTerms(news, 4)
Here is the output that last line produces:
[1] "acadine" "adobe" "android" "browser" "challenge" "companies" "company" "devices" "firefox" "flash"
[11] "funding" "gong" "hackers" "international" "ios" "like" "million" "mobile" "mozilla" "mozillas"
[21] "new" "online" "operating" "said" "security" "smartphones" "software" "startup" "system" "systems"
[31] "tsinghua" "unigroup" "used" "users" "videos" "web" "will"
In line 1, for example, we see "companies" and "company", and we see "devices". I thought stemming would reduce "companies" and "company" to the same stem ("compani"?), and I thought it would trim the "s" off plurals like "devices". Am I wrong about that? If not, why isn't this code producing the desired result here?