I have a loop, in which I have to calculate a distance between one string and a vector of many strings. I use the package "stringdist" and the function of the same name, which works well.
However, it takes quite some time to calculate the distances each time. For example, to get the distances between one word and 3.5 million other words takes roughly 0.5 seconds. That does not seem much, but to do that 3.5 million times does take too long.
I cannot do a distance matrix, as this would be too large and I split up the calculations to avoid having to calculate a complete matrix, which would be 3.5 million by 3.5 million.
Is there possibly a way to compute Levenshtein and/or Hamming distances using Rcpp to speed this up (a lot)?
I have tried the compiler package and using "cmpfun" but this does not change the speed. I assume I have to write the function stringdist using C++? Unfortunately I have no idea how.
The stringdist is the part of a loop which takes >95% of the time of that loop step, so a reduction would help immensely.
Any help would be appreciated.
Edit 1:
This is a small vector of strings as example:
bioID
[1] "F" "FPhGLGcbLFF" "FhGWhSLYcmYF" "FhGcGbGYhF" "GGLcSFhGmhLYF" "GGbhhcLFF"
[7] "GLWGcGLmhcLFF" "GLYmcmFF" "GLbcmYF" "GLhhFGmGccmFF" "GLhhLGYLbGmFF" "GLhhLGbGchmYF"
[13] "GLhhLGmLYcmYF" "GLhhLLLGmcmFF" "GLhhLhGGGcmYF" "GLhhPPmmchmYF" "GLhhmGbGLcmYF" "GLhmYbGmmPmbF"
[19] "GLhmcbLFF" "GPhbhYmhPLbF" "GbhmLFF" "GhhYcmYF" "GmGhGYhcLFF" "GmbmbmhcLFF"
[25] "LGGYmcmFF" "LGLGmPmbF" "LGbF" "LGhbLchmYF" "LLGLYhGcLFF" "LLPGhhbPLmcmFF"
[31] "LLcmmPPmhcLFF" "LLhhLLGLhhYmcmFF" "LPPhcbLFF" "LYcmYF" "LbGmmPmbF" "LbLLGGccmYF"
[37] "LhPbGchmYF" "LhbGbmYGYhF" "LmhGLmLLhF" "PGYLhGcGYhF" "PLhhLLGLhhYmcmFF" "PLhhchhGhGLccmFF"
[43] "PLhmGLhhPmGGcmFF" "PbLhhbLmhGcLFF" "PbbcbGbcLGYhF" "PbhLcLGmhLYF" "PcLFF" "PcPcLFF"
[49] "PhbcLSmcmFF" "PmYcmYF" "PmbF" "SFFbmbhLYcmYF" "SGGGbhchmYF" "SGGPhLGcbLFF"
[55] "SGGmGcmhGcLFF" "SGLGcFGhcLFF" "SGLGmGLGcmYF" "SGLLGhGmhLYF" "SGLPbPmYmcmFF" "SGLWhGcGbLFF"
[61] "SGLmmLmhcLFF" "SGPLbbmmPmbF" "SGPmhLcbcchmYF" "SGSGGbLhchmYF" "SGWGYLGcmYF" "SGWhLbPLbcmYF"
[67] "SGbGGmhLYF" "SGcbLFGmmPmbF" "SGcmWGGGLLbF" "SGhLLGGLmcmYF" "SGhbhGPcYcmYF" "SGmGGLLFLYmcmFF"
[73] "SGmLGLLmPmbF" "SLFGGhGbbLcLFF" "SLFGbGFhcLFF" "SLFGmGGhGLmLLhF" "SLFPFbcLLLYcmYF" "SLFPLLGGhchmYF"
[79] "SLFSFbcLFF" "SLFbGhcmYGYhF" "SLFbGmLYGYhF" "SLFcGGLccbLFF" "SLFhGLLmhcLFF" "SLFmGLbcmGmcmFF"
[85] "SLFmPchmYcmYF" "SLFmbPLGLmLLhF" "SLGGGLLFYmcmFF" "SLGGGLLGmhcLFF" "SLGGGLmcbLFF" "SLGGGYmcmFF"
[91] "SLGGGhGLmLLhF" "SLGGGhLcYmcmFF" "SLGGGhhcLFF" "SLGGLGYhmcmFF" "SLGGLLGcYmcmFF" "SLGGLLhGcLFF"
[97] "SLGGLhFhcLFF" "SLGGSGLhGhhYmcmFF" "SLGGbLYcmYF" "SLGGbbcLYGYhF"
Edit 2: I used @Andrew's comment to re-think the problem and here is how I am working with it now:
As I right now only need the distances of 1 anyway, I only do the distance calculation of my string at hand with strings that are either the same length, one letter shorter or one letter longer. That reduces time significantly. It still takes a very long time for large datasets, but it already helps.