problems in stemming in text analysis (Swedish data)

Question

In the following codes, my aim is to reduce the number of words with the same stem. For example, kompis in Swedish refer a friend in English, and the words with similar roots are kompisar, kompiserna.

rm(list=ls())
Sys.setlocale("LC_ALL","sv_SE.UTF-8")
library(tm)
library(SnowballC)
kompis <- c("kompisar", "kompis", "kompiserna")
stem_doc <- stemDocument(kompis, language="swedish")
stem_doc
1] "kompis" "kompis" "kompis"

I create a sample text file including the word kompis, kompisar, kompiserna. Then, I did some preproceses in the corpus via following codes:

        text <-  c("TV och vara med kompisar.",
               "Jobba på kompis huset",
               "Ta det lugnt, umgås med kompisar.",
               "Umgås med kompisar, vänner ",
               "kolla anime med kompiserna")
corpus.prep <- Corpus(VectorSource(text), readerControl    =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, PlainTextDocument)
corpus.prep <- tm_map(corpus.prep, stemDocument,language = "swedish")
head(content(corpus.prep[[1]]))

The results as follows. However, it includes the original words rather than same stem: kompis.

1] "TV och vara med kompisar."       
2] "Jobba på kompi huset"            
3] "Ta det lugnt, umgå med kompisar."
4] "Umgås med kompisar, vänner"      
5] "kolla anim med kompiserna"

Do you know how to fix it?

Not sure if this makes a difference, but try corpus.prep <- tm_map(corpus.prep, function(f) stemDocument(f, language = "swedish")). — TinglTanglBob, Oct 17 '18 at 14:26
At this moment, I am only using tm and stm, and would like to use others in the future. + I think using function does not change much. — Annika Magnusson, Oct 18 '18 at 18:22

score 1 · Answer 1 · answered Oct 17 '18 at 15:04

Using tidytext, see issue #17

library(dplyr)
library(tidytext)
library(SnowballC)

txt <- c("TV och vara med kompisar.",
         "Jobba på kompis huset",
         "Ta det lugnt, umgås med kompisar.",
         "Umgås med kompisar, vänner ",
         "kolla anime med kompiserna")

data_frame(txt = txt) %>%
  unnest_tokens(word, txt) %>%
  mutate(word = wordStem(word, "swedish"))

The wordStem function is from the snowballC package which comes with multiple languages, see getStemLanguages

Thanks this works well. It is interesting to see how text is sensitive when using the codes in different order. I saw it also in stopwords. — Annika Magnusson, Oct 18 '18 at 18:23

score 1 · Accepted Answer · answered Oct 17 '18 at 16:42

You are almost there, but using PlainTextDocument is interfering with your goal.

The following code will return your expected result. I'm using remove punctuation otherwise the stemming will not work on the works that are at the end of the sentence. Also you will see warning messages appearing after both tm_map calls. You can ignore these.

corpus.prep <- Corpus(VectorSource(text), readerControl    =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, removePunctuation)
corpus.prep <- tm_map(corpus.prep, stemDocument, language = "swedish")

head(content(corpus.prep))

[1] "TV och var med kompis"         "Jobb på kompis huset"          "Ta det lugnt umgås med kompis" "Umgås med kompis vänn"        
[5] "koll anim med kompis"

For this kind of work I tend to use quanteda. Better support and works a lot better than tm.

library(quanteda)

# remove_punct not really needed as quanteda treats the "." as a separate token.
my_dfm <- dfm(text, remove_punct = TRUE) 
dfm_wordstem(my_dfm, language = "swedish")

Document-feature matrix of: 5 documents, 15 features (69.3% sparse).
5 x 15 sparse Matrix of class "dfm"
       features
docs    tv och var med kompis jobb på huset ta det lugnt umgås vänn koll anim
  text1  1   1   1   1      1    0  0     0  0   0     0     0    0    0    0
  text2  0   0   0   0      1    1  1     1  0   0     0     0    0    0    0
  text3  0   0   0   1      1    0  0     0  1   1     1     1    0    0    0
  text4  0   0   0   1      1    0  0     0  0   0     0     1    1    0    0
  text5  0   0   0   1      1    0  0     0  0   0     0     0    0    1    1

This is what I needed. Many thanks providing quentada, as well. — Annika Magnusson, Oct 18 '18 at 18:24

problems in stemming in text analysis (Swedish data)

2 Answers2