I recently developed a fuzzy-string-matching routine on a Windows box in R. I was really pleased by the speed. Now I try to run the same procedure on a virtual redhat server and it is much slower, i.e. by a factor of approx. 100. The whole procedure takes 1 hour on the Windows machine (6 cores, Intel, 3.4Ghz)
What I basically do is this:
location <- (if (RB$ORT[x] == "n/a"){rep(NA, length(TAC$ORT))} else {stringdist(RB$ORT[x], TAC$ORT, useBytes = TRUE)})
On the redhat machine (14 cores, AMD, 2.6 GHz) I run R with openblas enabled. The r-package stringdist is on both machines in Version 0.9.4.1 The above command is run some million times. Odd enough it even seems to slow down. When starting the process my log tells me:
get location right: 0.04 secs
engine used: tclget location right: 0.05 secs
engine used: tclget location right: 0.02 secs
engine used: tclget location right: 0.01 secs
engine used: tclget location right: 0.02 secs
engine used: tclget location right: 0.03 secs
engine used: tclget location right: 0.02 secs
After some hours it tells me:
get location right: 0.27 secs
get location right: 0.27 secs
get location right: 0.26 secs
engine used: tclget location right: 0.14 secs
get location right: 0.27 secs
engine used: tclget location right: 0.26 secs
engine used: tclget location right: 0.23 secs
engine used: tclget location right: 0.14 secs
get location right: 0.28 secs
get location right: 0.29 secs
On the Windows machine this looks like this (6 processes are writing to the log):
get location right: 0 secs
get location right: 0 secs
engine used: tclget location right: 0 secs
get location right: 0 secs
engine used: tclget location right: 0 secs
engine used: tclengine used: tclget location right: 0 secs
get location right: 0 secs
get location right: 0 secs
On the Windows machine we don't use the RevolutionR (or the R-open-MS variant thereof). Don't know if it uses the mkl, but actually it should not matter when working with the character-class in R. Could some sort of encoding problem be the cause? When profiling with Rprof the absolute time it takes is not reported identically on windows and linux. Regarding the relative time only enc2utf8 seems to figure more prominent on linux.
Any other ideas? thnx martin