0

I recently developed a fuzzy-string-matching routine on a Windows box in R. I was really pleased by the speed. Now I try to run the same procedure on a virtual redhat server and it is much slower, i.e. by a factor of approx. 100. The whole procedure takes 1 hour on the Windows machine (6 cores, Intel, 3.4Ghz)

What I basically do is this:

location <- (if (RB$ORT[x] == "n/a"){rep(NA, length(TAC$ORT))} else {stringdist(RB$ORT[x], TAC$ORT, useBytes = TRUE)})

On the redhat machine (14 cores, AMD, 2.6 GHz) I run R with openblas enabled. The r-package stringdist is on both machines in Version 0.9.4.1 The above command is run some million times. Odd enough it even seems to slow down. When starting the process my log tells me:

get location right: 0.04 secs
engine used: tclget location right: 0.05 secs
engine used: tclget location right: 0.02 secs
engine used: tclget location right: 0.01 secs
engine used: tclget location right: 0.02 secs
engine used: tclget location right: 0.03 secs
engine used: tclget location right: 0.02 secs

After some hours it tells me:

get location right: 0.27 secs
get location right: 0.27 secs
get location right: 0.26 secs
engine used: tclget location right: 0.14 secs
get location right: 0.27 secs
engine used: tclget location right: 0.26 secs
engine used: tclget location right: 0.23 secs
engine used: tclget location right: 0.14 secs
get location right: 0.28 secs
get location right: 0.29 secs

On the Windows machine this looks like this (6 processes are writing to the log):

get location right: 0 secs
get location right: 0 secs
engine used: tclget location right: 0 secs
get location right: 0 secs
engine used: tclget location right: 0 secs
engine used: tclengine used: tclget location right: 0 secs
get location right: 0 secs
get location right: 0 secs

On the Windows machine we don't use the RevolutionR (or the R-open-MS variant thereof). Don't know if it uses the mkl, but actually it should not matter when working with the character-class in R. Could some sort of encoding problem be the cause? When profiling with Rprof the absolute time it takes is not reported identically on windows and linux. Regarding the relative time only enc2utf8 seems to figure more prominent on linux.

Any other ideas? thnx martin

exilsaxo
  • 1
  • 1
  • I forgot to mention that R Versions are 3.2.0 on redhat and 3.2.1 on windows – exilsaxo Apr 06 '16 at 09:14
  • try specifying the number of threads explicitly. `stringdists` gets the nr of available threads using `parallel::detectCores()`. Stringdist does not depend on blas or MKL: there's no linear algebra in it and all algo's are implemented from scratch. –  Oct 17 '16 at 15:20

0 Answers0