Attempting to find distance between every combination of 600,000 entries (python)

Question

I have a set of about ~600,000 email addresses that I have to analyze for a project. The goal is to find the similarities between the names of each email to all other emails using the Levenshtein distance, so the part before the @. I was looking into creating all of the combinations and inputting them into an HDF file or something out of memory but it is going to take way to long to generate all of those email addresses. Is there any way to speed up a loop with parallel processing or pooling so that it doesn't take days to run.

My first set of code is a generator so that I'm not running it all in memory and the second applies the distance metric. Instead of all the code with the HDF file, I just had it appended to a list to speed it up.

def makeCombos(data, i=2):
    for combo in map(list, combinations(data, i)):
        yield combo
l = []

def combos(data):
    for x in makeCombos(data):
        if levenshteinDistanceDP(x[0], x[1]) < 4:
            l.append(x)

I also looked into using some sort of nearest neighbor algorithm like annoy, as they seem to be much more computationally efficient. But I am having a lot of trouble figuring out how to vectorize the email addresses or even set up a model like that.

Any suggestions would help.

How is computed `levenshteinDistanceDP` ? The classical way to implement it is in `O(n m)` time where n and m are the length of the two strings. However, such an implementation is not the best one (especially if this is implemented in pure python). — Jérôme Richard, Jun 17 '20 at 11:33

score 0 · Answer 1 · answered Jun 16 '20 at 21:22

0

You can use multiprocessing module to parallelize the processing as follows. Also, use itertools module to generate the combinations.

import itertools, multiprocessing

def combos(data):
    with multiprocessing.Pool() as pool:
        combos = itertools.combinations(data, 2)
        return [x for x in pool.starmap(levenshteinDistanceDP, combos) if x < 4]

answered Jun 16 '20 at 21:22

Venkatesh-Prasad Ranganath

1,776
11
19

I tried this on sample data with only 500 entries and it is taking exponentially longer than running it without multiprocessing. – Noah Schwarzkopf Jun 16 '20 at 21:48
With a dummy distance function `ld`, `pool.starmap(ld, combos)` expression leads to 3-4x speed up relative to `map(lambda y: ld(*y), combos)` expression on my machine. So, I am surprised by what you are observing. – Venkatesh-Prasad Ranganath Jun 16 '20 at 22:56

Jérôme Richard · Answer 2 · 2020-06-17T21:03:32.137

TL;DR: As described here, the Levenstein distance may not be the best to measure the distance of all the mails. It is probably better to use alternative distances or even totally change the approach. Moreover, heuristics can be used to speed the execution up.

Since you have k strings and you want to compare all pairs, the overall complexity is O(k^2 L) where L is the complexity of computing levenshteinDistanceDP. Thus, mainly due to the k^2 factor, the algorithm will take at least several hours/days to complete in your case.

To significantly reduce the complexity of the computation, Hamming distance and Jaccard similarity are a good start.

If using an approximation is fine, you can alternatively:

design a function f that transform a mail into a representative numerical descriptor (ie. feature vector) while preserving locality (how close mails are);
apply function f(s) on each mail string s;
compare all the resulting pairs efficiently (eg. using binary space partitioning such as k-d trees, or statistical/classification methods).

However, the hard part is to find a good f candidate (machine learning methods can help to find it).

Please note that you can use the above method to significantly filter the results before applying your current exact method. The resulting method (heuristic) is exact if the approximation never over-estimate the actual distance (ie. admissible heuristic).

UPDATE:

A simple admissible heuristic is the (corrected) manhattan distance of vectors containing character frequencies (ie. the number of occurrence a each character in a given set). Here is an example of code using this heuristic:

# Count the number of 'a', 'b', 'c' ..., 'y', 'z' and '0', '1', ..., '9' in the string s
def freq(s):
    res = np.zeros(36, dtype=int)
    for c in map(ord, s.upper()):
        if c >= 65 and c <= 90: # A-Z
            res[c-65] += 1
        elif c >= 48 and c <= 57: # 0-9
            res[c-48+26] += 1
    return res

# Compare the two frequency vectors fs and ft
def freqDist(fs, ft):
    manDist = np.abs(fs-ft).sum()
    return (manDist + 1) // 2

# Faster heuristic but not admissible (ie. approximation)
def freqDistApprox(fs, ft):
    return np.abs(fs-ft).sum()

l = []

def fasterCombos(data):
    maxi = len(list(makeCombos(data)))
    count = 0
    freqs = {s: freq(s) for s in data}  # Precompute frequencies (feature vectors)
    for x in makeCombos(data):
        s, t = x[0], x[1]
        if freqDist(freqs[s], freqs[t]) < 4: # Estimate the levenshtein distance quickly
            if levenshteinDistanceDP(s, t) < 4:
                l.append(x)

This simple heuristic should significantly reduce the number of computed Levenshtein distances. However, it tends to clearly underestimate the distance. freqDistApprox speed the execution up at the cost of an approximated result.

Once a good heuristic has been found, binary space partitioning can be used compare only feature vectors that are near each other (with an estimated Levenshtein distance close enough). This can be done quite efficiently by iterating over all feature vectors and check their neighborhood. The complexity of the algorithm is O(k n (L + D log(k))) where n is the average number of close neighbors (with 0 < n <= k) and D the dimension of each feature vector.

Finally, note that the worst case complexity is still O(k^2) since l can contain O(k^2) pairs if all mails are equal or near equal (with a very small Levenshtein distance, this is the case where n ~= k). However, when mails are very different from each other (or the distance threshold is small enough) and a good heuristic is used, the resulting approach should be drastically faster (since n << k).

Would you have any resources or examples of what kind of function I could use for f or how to implement binary space partitioning? — Noah Schwarzkopf, Jun 17 '20 at 16:03
@NoahSchwarzkopf I updated the answer with a detailed example for `f` and more information about the binary space partitioning method (low-level implementation details are available in the provided links). — Jérôme Richard, Jun 17 '20 at 21:09
This is super helpful, I appreciate it! The only thing I'm a little confused about is using binary space partitioning on the vectors. The vectors themselves don't seem as detailed as they should be to find the most similar words. Also, the biggest problem here is the time it will take, and running this heuristic over 600k name is still going to take forever. — Noah Schwarzkopf, Jun 19 '20 at 15:15

Attempting to find distance between every combination of 600,000 entries (python)

2 Answers2