I have a vector A of size 100k+ and i want to calculate the distance between every element of this vector with every other element. I am trying to solve this problem in R, using its in-built adist function and also trying to use the stringdist package. The problem is that it is computationally very heavy and it keeps running for days without ending.
The end problem that I am trying to solve is finding duplicates or near-duplicates using a distance measure and then build some sort of a classification model around it.
The code I am using currently is
# declare an empty data frame and append data to it
matchedStr_vecA <- data.frame(row_index = integer(),
col_index = integer(),
vecA_i = character(),
vecA_j = character(),
dist_diff_vecA = double(),
stringsAsFactors=FALSE)
k = 1 # (keeps track of the pointer to the data frame)
# Run 2 different loops to calculate the bottom half of the matrix (below the diagonal -
# as the diagonal elements will be zero and the upper half is the mirror image of the bottom half)
for (i in 1:length(vecA)) {
for (j in 1:length(vecA)) {
if (i < j) {
dist_diff_vecA <- stringdist(vecA[i], vecA[j], method = "lv")
matchedStr_invId[k,] <- c(i, j, vecA[i], vecA[j], dist_diff_vecA)
k <- k + 1
}
}
}
Please help me to bring this computation from O(n^2) to O(n). I am fine with using python as well. I was told that this can be solved using dynamic programming programming but I am not sure how to implement it.
Thanks all