1

I am trying to do a very simple word steming in R and getting something very unexpected. In the code below 'complete' variable is 'NA'. Why can't I complete stem on the word easy?

library(tm) 
library(SnowballC)
dict <- c("easy")
stem <- stemDocument(dict, language = "english")
complete <- stemCompletion(stem, dictionary=dict)

Thank You!

user2630162
  • 137
  • 1
  • 12

2 Answers2

1

You can see the internals of the stemCompletion() function with tm:::stemCompletion.

function (x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest")){
if(inherits(dictionary, "Corpus")) 
  dictionary <- unique(unlist(lapply(dictionary, words)))
type <- match.arg(type)
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))
switch(type, first = {
  setNames(sapply(possibleCompletions, "[", 1), x)
}, longest = {
  ordering <- lapply(possibleCompletions, function(x) order(nchar(x), 
      decreasing = TRUE))
  possibleCompletions <- mapply(function(x, id) x[id], 
      possibleCompletions, ordering, SIMPLIFY = FALSE)
  setNames(sapply(possibleCompletions, "[", 1), x)
}, none = {
  setNames(x, x)
}, prevalent = {
  possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), 
      decreasing = TRUE))
  n <- names(sapply(possibleCompletions, "[", 1))
  setNames(if (length(n)) n else rep(NA, length(x)), x)
}, random = {
  setNames(sapply(possibleCompletions, function(x) {
      if (length(x)) sample(x, 1) else NA
  }), x)
}, shortest = {
  ordering <- lapply(possibleCompletions, function(x) order(nchar(x)))
  possibleCompletions <- mapply(function(x, id) x[id], 
      possibleCompletions, ordering, SIMPLIFY = FALSE)
  setNames(sapply(possibleCompletions, "[", 1), x)
})

}

The x argument is your stemmed terms, dictionary is the unstemmed. The only line that matters is the fifth; it does a simple regex match for the stemmed word in the list of dictionary terms.

possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))

Therefore it fails, since it can't find a match for "easi" with "easy". If you also have the word "easiest" in your dictionary, then both terms match, since there is now a dictionary word with the same beginning four letters to match to.

library(tm) 
library(SnowballC)
dict <- c("easy","easiest")
stem <- stemDocument(dict, language = "english")
complete <- stemCompletion(stem, dictionary=dict)
complete
     easi   easiest 
"easiest" "easiest" 
christopherlovell
  • 3,800
  • 4
  • 19
  • 26
  • Thank You for the explanation! I guess I should now look into stem function for why does it actually stems word 'easy' into 'easi'. – user2630162 Apr 21 '15 at 20:22
0

wordStem() seems to do it..

library(tm) 
library(SnowballC)
dict <- c("easy")
> wordStem(dict)
[1] "easi"
cory
  • 6,529
  • 3
  • 21
  • 41
  • The stemming works. My point was that the stemCompletion does not. I would expect that the stemCompletion function would replace easi back to easy. Am I wrong thinking that it should? – user2630162 Apr 08 '15 at 19:21
  • Yeah, looks like it just fails for "easy". Try `dict <- c("easy", "easiest", "easier")` and rerun. Seems like it just can't figure out "easy" – cory Apr 08 '15 at 19:49
  • @cory this is exactly the case. See my answer for details – christopherlovell Apr 21 '15 at 18:34