I am trying to prepare a dataset for machine learning. In the process I would like to remove (stop) words which has few occurrences (often related to bad OCR readings). Currently I have a list of words containing approx 1 mio words which I want to remove.
But, it take a long time to process my dataset with this setup.
library(stringi)
#generate the stopword list
b <- stri_rand_strings(1000000, 4, pattern = "[A-Za-z0-9]")
#remove stopstopwords form the dataset
system.time({
a <- stri_rand_strings(10, 4, pattern = "[A-Za-z0-9]")
c <- a[!(a %in% b)]
c
})
user system elapsed
0.14 0.00 0.14
It seems like 'a %in% b' is not (far from) O(N). Its not feasible to run this on the entire dataset as the proces does not complete within several hours.
Is there more efficient ways to compare two vectors in R?
I suspect it should be very fast as its a lookup. I did a test with Dictionary in C# which completes within a few minutes.