0

I have a loop, in which I have to calculate a distance between one string and a vector of many strings. I use the package "stringdist" and the function of the same name, which works well.

However, it takes quite some time to calculate the distances each time. For example, to get the distances between one word and 3.5 million other words takes roughly 0.5 seconds. That does not seem much, but to do that 3.5 million times does take too long.

I cannot do a distance matrix, as this would be too large and I split up the calculations to avoid having to calculate a complete matrix, which would be 3.5 million by 3.5 million.

Is there possibly a way to compute Levenshtein and/or Hamming distances using Rcpp to speed this up (a lot)?

I have tried the compiler package and using "cmpfun" but this does not change the speed. I assume I have to write the function stringdist using C++? Unfortunately I have no idea how.

The stringdist is the part of a loop which takes >95% of the time of that loop step, so a reduction would help immensely.

Any help would be appreciated.

Edit 1:

This is a small vector of strings as example:

bioID
  [1] "F"                 "FPhGLGcbLFF"       "FhGWhSLYcmYF"      "FhGcGbGYhF"        "GGLcSFhGmhLYF"     "GGbhhcLFF"        
  [7] "GLWGcGLmhcLFF"     "GLYmcmFF"          "GLbcmYF"           "GLhhFGmGccmFF"     "GLhhLGYLbGmFF"     "GLhhLGbGchmYF"    
 [13] "GLhhLGmLYcmYF"     "GLhhLLLGmcmFF"     "GLhhLhGGGcmYF"     "GLhhPPmmchmYF"     "GLhhmGbGLcmYF"     "GLhmYbGmmPmbF"    
 [19] "GLhmcbLFF"         "GPhbhYmhPLbF"      "GbhmLFF"           "GhhYcmYF"          "GmGhGYhcLFF"       "GmbmbmhcLFF"      
 [25] "LGGYmcmFF"         "LGLGmPmbF"         "LGbF"              "LGhbLchmYF"        "LLGLYhGcLFF"       "LLPGhhbPLmcmFF"   
 [31] "LLcmmPPmhcLFF"     "LLhhLLGLhhYmcmFF"  "LPPhcbLFF"         "LYcmYF"            "LbGmmPmbF"         "LbLLGGccmYF"      
 [37] "LhPbGchmYF"        "LhbGbmYGYhF"       "LmhGLmLLhF"        "PGYLhGcGYhF"       "PLhhLLGLhhYmcmFF"  "PLhhchhGhGLccmFF" 
 [43] "PLhmGLhhPmGGcmFF"  "PbLhhbLmhGcLFF"    "PbbcbGbcLGYhF"     "PbhLcLGmhLYF"      "PcLFF"             "PcPcLFF"          
 [49] "PhbcLSmcmFF"       "PmYcmYF"           "PmbF"              "SFFbmbhLYcmYF"     "SGGGbhchmYF"       "SGGPhLGcbLFF"     
 [55] "SGGmGcmhGcLFF"     "SGLGcFGhcLFF"      "SGLGmGLGcmYF"      "SGLLGhGmhLYF"      "SGLPbPmYmcmFF"     "SGLWhGcGbLFF"     
 [61] "SGLmmLmhcLFF"      "SGPLbbmmPmbF"      "SGPmhLcbcchmYF"    "SGSGGbLhchmYF"     "SGWGYLGcmYF"       "SGWhLbPLbcmYF"    
 [67] "SGbGGmhLYF"        "SGcbLFGmmPmbF"     "SGcmWGGGLLbF"      "SGhLLGGLmcmYF"     "SGhbhGPcYcmYF"     "SGmGGLLFLYmcmFF"  
 [73] "SGmLGLLmPmbF"      "SLFGGhGbbLcLFF"    "SLFGbGFhcLFF"      "SLFGmGGhGLmLLhF"   "SLFPFbcLLLYcmYF"   "SLFPLLGGhchmYF"   
 [79] "SLFSFbcLFF"        "SLFbGhcmYGYhF"     "SLFbGmLYGYhF"      "SLFcGGLccbLFF"     "SLFhGLLmhcLFF"     "SLFmGLbcmGmcmFF"  
 [85] "SLFmPchmYcmYF"     "SLFmbPLGLmLLhF"    "SLGGGLLFYmcmFF"    "SLGGGLLGmhcLFF"    "SLGGGLmcbLFF"      "SLGGGYmcmFF"      
 [91] "SLGGGhGLmLLhF"     "SLGGGhLcYmcmFF"    "SLGGGhhcLFF"       "SLGGLGYhmcmFF"     "SLGGLLGcYmcmFF"    "SLGGLLhGcLFF"     
 [97] "SLGGLhFhcLFF"      "SLGGSGLhGhhYmcmFF" "SLGGbLYcmYF"       "SLGGbbcLYGYhF"    

Edit 2: I used @Andrew's comment to re-think the problem and here is how I am working with it now:

As I right now only need the distances of 1 anyway, I only do the distance calculation of my string at hand with strings that are either the same length, one letter shorter or one letter longer. That reduces time significantly. It still takes a very long time for large datasets, but it already helps.

Nicholas
  • 93
  • 1
  • 9
  • 6
    The `stringdist` function is written in C (from the source code: `.Call("R_stringdist"..`), so I don't think you will get an improvement. It sounds like an embarrassingly parallel problem, so parallelisation could give you some time improvements. – LyzandeR Jun 12 '19 at 13:28
  • 3
    Have you profiled the code? Where are the bottlenecks? You will need a better algorithm if they are in the compiled part of the stringdist package. – Ralf Stubner Jun 12 '19 at 13:29
  • @LyzandeR thank you for the quick feedback, do you have an idea how to approach this differently? Or do I have to live with the fact that it just takes time to calculate 3.5 million distances... ? – Nicholas Jun 12 '19 at 13:29
  • @RalfStubner do you know of any other/better algorithm to calculate string distances? I would be happy to try them. – Nicholas Jun 12 '19 at 13:31
  • 2
    It sounds like you can parallelise it. It will reduce the time significantly. If you can access cloud infrastructure you would be able to do this fairly quickly (based on the explanation in your question), but if you rely on a single core, I think you would have to live with the fact that distance calculations take time. – LyzandeR Jun 12 '19 at 13:32
  • @LyzandeR I had tried before to do the loop with foreach and doparallel, but to be honest the simple "for" loop was the quickest (at least from what I've tried so far). – Nicholas Jun 12 '19 at 13:32
  • @LyzandeR I feared that this might be the case. I will try to look into a cloud solution for now... – Nicholas Jun 12 '19 at 13:35
  • Unfortunately I don't know any faster algorithms, but that does not mean much. – Ralf Stubner Jun 12 '19 at 13:35
  • 2
    `foreach` has a high overhead w.r.t parallelization. Consider using R's `parallel` package or the newer entrant `future`. This will likely be the only way to speed up the comparisons as the algorithm is already written in C. – coatless Jun 12 '19 at 13:36
  • Definitely look into parallel / getting more computer power. Also, if you are able, pre-cleaning the data may reduce the number of comparisons (I do not know the nature of your data, and if it makes sense for your case). I just ran benchamrks using `sapply` with two packages and it looks like `stringdist` is noticeably faster than the package I used in the past: `RecordLinkage`. Good luck!! – Andrew Jun 12 '19 at 13:58
  • @coatless I just tried future, but the gain is rather minimal (<5%), I assume that the bottleneck here is really the computing power of my machine. As a side note - I get different results using future... I'll have to validate which one is the true result. – Nicholas Jun 12 '19 at 14:04
  • @Andrew unfortunately I cannot do pre-cleaning, as the only rows I would not need for the loop are rows where there is no distance ==1 and to find that out - they have to go into the loop. It's a catch 22. For the smaller data sets I could do a matrix before and then see which ones have to go into the loop, but the solution has to be scalable and I'm looking into >>50 million rows in the future of this project. – Nicholas Jun 12 '19 at 14:09
  • Distance for >50 million unique text strings is daunting. I would be hard-pressed to find some way to reduce it. You could filter out redundant strings by removing duplicates or using `unique` and then merge with your larger dataset after computing distance (that is if you are choosing the smallest distance or something of that nature). If you post a small example of your data I would be happy to at least try to help reduce it. – Andrew Jun 12 '19 at 15:06
  • @Andrew there are no duplicate values in there, everything is unique. I post a part of the vector in the original question for you to look at! – Nicholas Jun 12 '19 at 16:09
  • 1
    Very helpful! Do you need to match the closest or do you need to keep all the distances in a list? What do you need to do if there are multiple strings with the same value. The only way i can think to reduce is to work with a window for the number of characters in each string (i.e., 5 characters +/- 3 characters; instead of comparing a string with one character to one with 15 characters). But I don't know if that suites your needs. As a reference, I recently had to match the closest distance for a vector of ~1000 to one of ~90000. Not as many as you, but still not pleasant. – Andrew Jun 12 '19 at 19:01
  • @Andrew I need to know which strings have the distance of ==1 (right now I am working with 1, might change at a later time). Every string has appr. 800 values attached to it and I gather up all those values where the distance is 1 and add those to the values of my current string. Then I move on. I could do a subset of strings... your idea would be to group according to length because if the length differs a lot - there will be no distances of 1 anyway! great idea. However, how would I make sure that I have every string exactly once, as I would have to get overlapping groups? – Nicholas Jun 12 '19 at 19:08
  • Ok, could you possibly post your expected output either in a new question or edit this question. I think I am unclear on whether each string needs a match (and there be two records of each match) or if just one will do. I would also want to know whether or not `sum(duplicated(bioID))` is zero. – Andrew Jun 13 '19 at 11:45
  • 3
    stringdist is already paralellized. See ?stringdist for details. It is also a good idea to think about what distance measure you need. Again, see ?stringdist for details. –  Jul 26 '19 at 07:22

0 Answers0