I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).
I have posted the code below. Does anyone know an alternative for doing this?
Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.
##Read in the dictionary
voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
word <- Corpus(VectorSource(voc))[1]
tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Structure of "voc" data frame:
Word Stem
1 abandonnere abandonner
2 abandonnerede abandonner
3 abandonnerende abandonner
...
313273 åsyns åsyn