3

I am trying to prepare a dataset for machine learning. In the process I would like to remove (stop) words which has few occurrences (often related to bad OCR readings). Currently I have a list of words containing approx 1 mio words which I want to remove.

But, it take a long time to process my dataset with this setup.

library(stringi)
#generate the stopword list
b <- stri_rand_strings(1000000, 4, pattern = "[A-Za-z0-9]")

#remove stopstopwords form the dataset
system.time({
  a <- stri_rand_strings(10, 4, pattern = "[A-Za-z0-9]") 
  c <- a[!(a %in% b)]
  c
})

user  system elapsed 
0.14    0.00    0.14 

It seems like 'a %in% b' is not (far from) O(N). Its not feasible to run this on the entire dataset as the proces does not complete within several hours.

Is there more efficient ways to compare two vectors in R?

I suspect it should be very fast as its a lookup. I did a test with Dictionary in C# which completes within a few minutes.

bartektartanus
  • 15,284
  • 6
  • 74
  • 102
henrikwh
  • 61
  • 5
  • Try with `%chin%` from `data.table` should be faster i.e. `system.time({ c <- a[!(a %chin% b)]}) # user system elapsed 0.01 0.00 0.02` compared to `0.13` based on `%in%` – akrun Aug 15 '16 at 12:11

1 Answers1

0

stringi search function like stri_detect_fixed is way faster than %in% operator. Maybe this will help you:

  1. paste all your stopwords using some separator that these words don't contain -> this will create one long string
  2. use stri_detect_fixed on this long string

This solutions turns out to be twice as fast or even twenty times faster if your stopword vector is pasted once and reused.

Some code example with benchmarks:

library(stringi)
require(microbenchmark)
#generate the stopword list
b <- stri_rand_strings(1000000, 4, pattern = "[A-Za-z0-9]")
a <- stri_rand_strings(10, 4, pattern = "[A-Za-z0-9]") 

#base R solution
f1 <- function(a,b){
  a[!(a %in% b)]
}

# paste inside function
f2 <- function(a,b){
  c <- stri_paste(b, collapse = ";")
  a[stri_detect_fixed(c, a)]
}

# paste before and use it later
c <- stri_paste(b, collapse = ";")
f3 <- function(a, c){
  a[stri_detect_fixed(c, a)]
}

microbenchmark(f1(a,b), f2(a,b), f3(a,c))
# Unit: milliseconds
#      expr      min        lq       mean     median         uq       max neval
#  f1(a, b) 63.36563 67.931506 102.725257 116.128525 129.665107 208.46003   100
#  f2(a, b) 52.95146 53.983946  58.490224  55.860070  59.863900  89.41197   100
#  f3(a, c)  3.70709  3.831064   4.364609   4.023057   4.310221  10.77031   100
bartektartanus
  • 15,284
  • 6
  • 74
  • 102