0

I am having problem with getting the right text after stemming in R. Eg. 'papper' should show as 'papper' but instead shows up as 'papp', 'projekt' becomes 'projek'.

The frequency cloud generated thus shows these shortened versions which loses the actual meaning or becomes incomprehensible.

What can I do to get rid of this problem? I am using the latest version of snowball(0.6.0).

R Code:

library(tm)
library(SnowballC)
text_example <- c("projekt", "papper", "arbete")
stem_doc <- stemDocument(text_example, language="sv")
stem_doc

Expected:
stem_doc
[1] "projekt" "papper"   "arbete" 

Actual:
stem_doc
[1] "projek" "papp"   "arbet"
Dejie
  • 95
  • 7
  • That's what stemmers do - they're a quick and dirty way of getting a single token out of a related terms, though this single token might not match anything in natural language. If you don't like it, you need to use morphological analysers and/or dictionaries (depending on the language), which are much more precise but more difficult to make. – Amadan May 13 '19 at 08:32
  • 1
    Also, check out the [difference between lemmatization and stemming](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/) . It might help getting a clearer picture. – Newl May 13 '19 at 08:35
  • 2
    check if the package [udpipe](https://bnosac.github.io/udpipe/en/index.html) can help you out. With it you can do tokenization and lemmatization in Swedish. – phiver May 13 '19 at 08:36

1 Answers1

0

What you describe here is actually not stemming but is called lemmatization (see @Newl's link for the difference).

To get the correct lemmas, you can use the R package UDPipe, which is a wrapper around the UDPipe C++ library.

Here is a quick example of how you would do what you want:

# install.packages("udpipe")    
library(udpipe)
dl <- udpipe_download_model(language = "swedish-lines")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.3/master/inst/udpipe-ud-2.3-181115/swedish-lines-ud-2.3-181115.udpipe to C:/Users/Johannes Gruber/AppData/Local/Temp/RtmpMhaF8L/reprex8e40d80ef3/swedish-lines-ud-2.3-181115.udpipe

udmodel_swed <- udpipe_load_model(file = dl$file_model)

text_example <- c("projekt", "papper", "arbete")

x <- udpipe_annotate(udmodel_swed, x = text_example)
x <- as.data.frame(x)
x$lemma
#> [1] "projekt" "papper"  "arbete"
JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • Thanks to Amadan, Newl, phiver and JBGruber for helping me in understanding the difference between stemming and lemmatization, and udpipe usage for this. – Dejie May 13 '19 at 21:10