1

Is there a more efficient way to achieve the following?

library(dplyr)
filers <- sapply(1:100, function(z) sample(letters, sample(1:20, 1), replace=T) %>% paste(collapse='')) %>% unlist() %>% unname()
n <- length(unique(filers))
similarityMatrix <- matrix(0, nr=n, nc=n)
for (i in 1:n) {
    for (j in 1:n) {
        similarityMatrix[i, j] <- compare_strings(filers[i], filers[j])
    }
}

Note: compare_strings is pseudo-code for the sake of implying the type of operation I'm trying to perform. Per the comments below, there was some confusion with the prior form of the question because stringdist comes with the function stringdistmatrix. My scenario involves a function that does not have that option and thus the question has been modified to reflect comments below.

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
tblznbits
  • 6,602
  • 6
  • 36
  • 66

1 Answers1

3

This is probably not more efficient, but it is more readable. It also allocates the matrix for you:

similarityMatrix <- outer(filers, filers, FUN=compare_strings)
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • I haven't got it to work yet, but I'm almost positive this is what I wanted. Thanks! – tblznbits Sep 13 '15 at 23:28
  • The answer you provided is definitely more readable, and I like the fact that it allocates the matrix for you. However, it's still very inefficient and I'd prefer something a bit faster. – tblznbits Sep 14 '15 at 13:27
  • The only way to get something faster is to vectorize the string comparison. – Matthew Lundberg Sep 14 '15 at 13:30
  • Yeah, I tried that as well with `similarityMatrix <- outer(filers, filers, FUN=Vectorize(compare_strings))`. Is there a different way to vectorize? For reference, `filers` contains 1633 elements, so the calculation is pretty intensive. Can `outer` be parallelized? – tblznbits Sep 14 '15 at 13:38
  • "Vectorize" runs a loop, so that's no help at all. I mean, write the `compare_strings` function so that it is inherently vectorized. – Matthew Lundberg Sep 14 '15 at 13:44
  • Not sure if you're still following this, but I ended up doing it in Python. It has a library that implements string comparison in C, so it's extremely fast. I just used `ratios = [compare_strings(s1, s2) for s1 in series for s2 in series]` and it finishes in less than two minutes. Thanks again for your insights into this. – tblznbits Sep 15 '15 at 23:34
  • That's a good solution. Since Python works for you, I recommend doing two things: Adding your solution as an answer, and adding the python tag to the question. It's up to you whether you should accept your own solution -- there's no points in it, but it indicates to future readers what worked best for your problem. – Matthew Lundberg Sep 16 '15 at 00:21