1

the agrep function in R is based on C code and is executed as such. However, I notice a significant (order of magnitude) performance gap between executing agrep from within R as compared to a direct system call to the command line executable of agrep. (tested only on Linux so far)

The essence of my code is this (x is a vector of 250K strings, xNoisy is a vector of 1000 randomly sampled strings in x, modified by a few random chars):

system.time( sapply(xNoisy, FUN = agrep,  x=x,max.distance = 2))
system.time(for (p in xNoisy) tmp=system(paste0("agrep -2 ", p, " strings.txt"),intern=TRUE) )

(Here strings.txt is the file containing the strings in x.) The first line takes 700 secs, the second 10 (!) secs. Why is this and is there any way to come closer to the performance of the Linux agrep within R ?

Markus Loecher
  • 367
  • 1
  • 16
  • This should probably be at stackoverflow – bdeonovic May 08 '14 at 14:06
  • If you benchmark, you should only time the functions of interest, i.e., don't use `sapply` in the first case and `for` in the second. Also, in your `for` loop you are overwriting `tmp` in each iteration resulting in a much smaller object than what `sapply` has to handle. In summary, from your bechmarks I can't judge if you have a valid question. – Roland May 08 '14 at 14:56
  • Thank your for the comments, I had started with a for loop in the first line and then tried to speed it up with sapply. The original code was system.time(for (p in xNoisy) tmp=agrep(p, x,max.distance = 2)) The performance was almost the same, i.e. close to 700 secs. My conclusion is that either the overhead of calling the C function is large or the agrep function in R is not optimized. – Markus Loecher May 08 '14 at 15:20

0 Answers0