2

I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example:

q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
               readerControl = list(language = "pt",
                                    load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"

$`character(0)`
[1] "pode"

If I use:

> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"

it does work! But if I use:

> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"

$`character(0)`
[1] "pode"

it doesn't work. Why so?

1 Answers1

1

Unfortunately you stumbled on a bug. stemDocument works if you pass on the language when you do:

stemDocument(x = c("poder", "pode"), language = "pt")
[1] "pod" "pod"

But when using this in tm_map, the function starts of with stemDocument.PlainTextDocument. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the function stemDocument.character, but without the language component. In stemDocument.character the default language is specified as English. So within the tm_map call (or the DocumentTermMatrix) the language you supply with it will revert back to English and the stemming doesn't work correctly.

A workaround could be using the package quanteda:

library(quanteda)
my_dfm <- dfm(x = c("poder", "pode"))
my_dfm <- dfm_wordstem(my_dfm, language = "pt")

my_dfm

Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
2 x 1 sparse Matrix of class "dfm"
       features
docs    pod
  text1   1
  text2   1

Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.

phiver
  • 23,048
  • 14
  • 44
  • 56
  • Tks @phiver! How can I get the word "pod" back to "pode" ou "poder" in my dfm with `quanteda`? – Guilherme Parreira Jan 15 '19 at 16:40
  • @GuilhermeParreira, you can't. You would need to use the `stemCompletion` function from tm. I never see the point in restemming, but then I never stem outside of English and even for English I find lemma's more useful. The lemma's and pos tagging are available in udpipe for Portuguese. You could use `stemCompletion` if you supply your own Portuguese dictionary. But read the documentation very carefully and check SO for some examples. It doesn't always behave as you expect. One of the reasons I tend to use other packages than tm. – phiver Jan 15 '19 at 17:28
  • My idea of using restemming is to make easier to the researcher to understand the analysis. Do you have any code which shows how to do it (starting from `quanteda` package?) – Guilherme Parreira Jan 15 '19 at 18:33
  • If dict is your dictionary with the words (`dict <- c("poder", "pode")`), then you could use something like this: `stemCompletion(tokens_wordstem(tokens(x = c("poder", "pode")), language = "pt"), dict, type = "first")`. This would result in pode turning into poder, because of the type selection. Read the help with stemCompletion which options there are. But for readability / understanding I would choose lemmatization. – phiver Jan 15 '19 at 18:47
  • 1
    Tks!! I managed to do what I wanted: `dicionario <- featnames(my_dfm) # Original words` `my_dfm <- dfm_wordstem(my_dfm, language = "pt")` `radicais <- featnames(my_dfm) # Stemmed words` `new.words <- as.character(stemCompletion(radicais, dictionary = dicionario)) # Fullfill the stem` `my_dfm <- dfm_replace(my_dfm, radicais, new.words) # Get back the words after stemming` – Guilherme Parreira Jan 26 '19 at 13:33