stri_replace_all_fixed slow on big data set - is there an alternative?

Question

I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).

I have posted the code below. Does anyone know an alternative for doing this?

Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.

##Read in the dictionary
 voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
 library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
 word <- Corpus(VectorSource(voc))[1]
 tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
 docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))

Structure of "voc" data frame:

       Word           Stem
1      abandonnere    abandonner
2      abandonnerede  abandonner
3      abandonnerende abandonner
...
313273 åsyns          åsyn

score 0 · Answer 1 · answered Feb 13 '17 at 22:40

To make a dictionary marching fast, you need to implement some clever data structures such as a prefix tree. 300000x search and replace just does not scale.

I don't think this will be efficient in R, but you will need to write a C or C++ extension. You have many tiny operations there, the overhead of the R interpreter will kill you when trying to do this in pure R.

stri_replace_all_fixed slow on big data set - is there an alternative?

1 Answers1