TL;DR: As described here, the Levenstein distance may not be the best to measure the distance of all the mails. It is probably better to use alternative distances or even totally change the approach. Moreover, heuristics can be used to speed the execution up.
Since you have k
strings and you want to compare all pairs, the overall complexity is O(k^2 L)
where L
is the complexity of computing levenshteinDistanceDP
. Thus, mainly due to the k^2
factor, the algorithm will take at least several hours/days to complete in your case.
To significantly reduce the complexity of the computation, Hamming distance and Jaccard similarity are a good start.
If using an approximation is fine, you can alternatively:
- design a function
f
that transform a mail into a representative numerical descriptor (ie. feature vector) while preserving locality (how close mails are);
- apply function
f(s)
on each mail string s
;
- compare all the resulting pairs efficiently (eg. using binary space partitioning such as k-d trees, or statistical/classification methods).
However, the hard part is to find a good f
candidate (machine learning methods can help to find it).
Please note that you can use the above method to significantly filter the results before applying your current exact method. The resulting method (heuristic) is exact if the approximation never over-estimate the actual distance (ie. admissible heuristic).
UPDATE:
A simple admissible heuristic is the (corrected) manhattan distance of vectors containing character frequencies (ie. the number of occurrence a each character in a given set).
Here is an example of code using this heuristic:
# Count the number of 'a', 'b', 'c' ..., 'y', 'z' and '0', '1', ..., '9' in the string s
def freq(s):
res = np.zeros(36, dtype=int)
for c in map(ord, s.upper()):
if c >= 65 and c <= 90: # A-Z
res[c-65] += 1
elif c >= 48 and c <= 57: # 0-9
res[c-48+26] += 1
return res
# Compare the two frequency vectors fs and ft
def freqDist(fs, ft):
manDist = np.abs(fs-ft).sum()
return (manDist + 1) // 2
# Faster heuristic but not admissible (ie. approximation)
def freqDistApprox(fs, ft):
return np.abs(fs-ft).sum()
l = []
def fasterCombos(data):
maxi = len(list(makeCombos(data)))
count = 0
freqs = {s: freq(s) for s in data} # Precompute frequencies (feature vectors)
for x in makeCombos(data):
s, t = x[0], x[1]
if freqDist(freqs[s], freqs[t]) < 4: # Estimate the levenshtein distance quickly
if levenshteinDistanceDP(s, t) < 4:
l.append(x)
This simple heuristic should significantly reduce the number of computed Levenshtein distances. However, it tends to clearly underestimate the distance. freqDistApprox
speed the execution up at the cost of an approximated result.
Once a good heuristic has been found, binary space partitioning can be used compare only feature vectors that are near each other (with an estimated Levenshtein distance close enough). This can be done quite efficiently by iterating over all feature vectors and check their neighborhood. The complexity of the algorithm is O(k n (L + D log(k)))
where n
is the average number of close neighbors (with 0 < n <= k
) and D
the dimension of each feature vector.
Finally, note that the worst case complexity is still O(k^2)
since l
can contain O(k^2)
pairs if all mails are equal or near equal (with a very small Levenshtein distance, this is the case where n ~= k
). However, when mails are very different from each other (or the distance threshold is small enough) and a good heuristic is used, the resulting approach should be drastically faster (since n << k
).