1

I have a vector A of size 100k+ and i want to calculate the distance between every element of this vector with every other element. I am trying to solve this problem in R, using its in-built adist function and also trying to use the stringdist package. The problem is that it is computationally very heavy and it keeps running for days without ending.

The end problem that I am trying to solve is finding duplicates or near-duplicates using a distance measure and then build some sort of a classification model around it.

The code I am using currently is

 # declare an empty data frame and append data to it
matchedStr_vecA <- data.frame(row_index = integer(),
                              col_index = integer(),
                              vecA_i = character(),
                              vecA_j = character(),
                              dist_diff_vecA = double(),
                              stringsAsFactors=FALSE)


k = 1 # (keeps track of the pointer to the data frame)
# Run 2 different loops to calculate the bottom half of the matrix (below the diagonal - 
# as the diagonal elements will be zero and the upper half is the mirror image of the bottom half)
for (i in 1:length(vecA)) { 
  for (j in 1:length(vecA)) { 
    if (i < j) {
      dist_diff_vecA <- stringdist(vecA[i], vecA[j], method = "lv")
      matchedStr_invId[k,] <- c(i, j, vecA[i], vecA[j], dist_diff_vecA)
      k <- k + 1
    }
  }
}

Please help me to bring this computation from O(n^2) to O(n). I am fine with using python as well. I was told that this can be solved using dynamic programming programming but I am not sure how to implement it.

Thanks all

Chandra
  • 11
  • 1
  • First, do you know the algorithm? – user202729 Aug 21 '18 at 10:19
  • You want to do `choose(100e3, 2)` comparisons. This is necessarily time-consuming, but you should do it with a compiled language and/or massive parallelization. Of course, it would be much better to switch from a brute-force to a smart approach for whatever you are actually trying to achieve. – Roland Aug 21 '18 at 10:20
  • @ Ronald & user 202729: I am new to the world of programming/coding and not aware of the approaches/algorithms to use. Can someone point me in the right direction – Chandra Aug 21 '18 at 10:30
  • I have not used the `stringdist::stringdist` function, but if it is similar to `adist` then the function is vectorized thus `stringdist(vecA, method = "lv")` should return the matrix of results. This is significantly faster (100-1000 times faster) than your double loop. Then parse the matrix for your desired results. Of course the question then becomes do you have the memory for 100k x100k matrix. – Dave2e Aug 21 '18 at 13:17
  • @Dave2e: There is a memory issue with solving this even with a 8 GB RAM. That is the reason I was looking for other options.. Any help on alternate methods like dynamic programming or otherwise – Chandra Aug 24 '18 at 05:21
  • As mentioned above trying to find the distances between all possible points is billions of combinations. If you perform some pre processing like sorting the list, you could then use a divide and conquer method to some many manageable pieces. – Dave2e Aug 24 '18 at 11:47

1 Answers1

0

I had the very same problem of calculating the distance matrix and I have successfully solved it in Python. The crucial elements of the solution to ensure you are equally splitting the calculations between threads is discussed in this question: How to split diagonal matrix into equal number of items each along one of axis?

There are two things to point out:

  1. The distance between two points is typically symmetrical so you can reuse this mathematical feature and calculate distance between i and j elements once and either store it or reuse it for the distance between j and i.

  2. The algorithm cannot be optimized below O(n^2) unless you are OK with imprecise results. And since you are new to programming I would not even consider going that way.

  3. You should be able to parallelize the calculations using index splitting as I suggested in the question above for a near-optimal solution.

sophros
  • 14,672
  • 11
  • 46
  • 75