the agrep function in R is based on C code and is executed as such. However, I notice a significant (order of magnitude) performance gap between executing agrep from within R as compared to a direct system call to the command line executable of agrep. (tested only on Linux so far)
The essence of my code is this (x is a vector of 250K strings, xNoisy is a vector of 1000 randomly sampled strings in x, modified by a few random chars):
system.time( sapply(xNoisy, FUN = agrep, x=x,max.distance = 2))
system.time(for (p in xNoisy) tmp=system(paste0("agrep -2 ", p, " strings.txt"),intern=TRUE) )
(Here strings.txt is the file containing the strings in x.) The first line takes 700 secs, the second 10 (!) secs. Why is this and is there any way to come closer to the performance of the Linux agrep within R ?