1

I have a collection of 40,000 strings and want to compare their similarity pairwise using fuzz.token_set_ratio(), but my brain is not wired correctly to do this in an efficient way, even after looking into vectorization.

Here is an example:

from fuzzywuzzy import fuzz

s = ["fuzzy was a strong bear", 
 "fuzzy was a large bear", 
 "fuzzy was the strongest bear you could ever imagine"]

similarities = []
l = len(s)

for i in range(l):
    similarities.append([])
    for j in range(l):
        similarities[i].append(fuzz.token_set_ratio(s[i], s[j]))
similarities

Now obviously, this code has at least two short comings. First, it uses inefficient for-loops. Second, while the resulting similarities matrix is symmetric (this is not always true, but ignore that for now) and I only need to compute the upper or lower triangle, it computes all elements. The latter is probably something I could code my way around, but I am looking for the quickest way to arrive at similarities in Python.

Edit: Here is another piece of perhaps useful information. I tried to speed up the process using pdist, which seems to perform well for some similar tasks. However, in this case, it seems to be slower than my inefficient for-loops for some reason.

Here is the code:

from fuzzywuzzy import fuzz
from scipy.spatial.distance import pdist, squareform
import numpy as np

def pwd(string1, string2):
    return fuzz.token_set_ratio(string1, string2)

s = []
for i in range(100):
    s.append("fuzzy was a strong bear")
    s.append("fuzzy was a large bear")
    s.append("fuzzy was the strongest bear you could ever imagine")

def pwd_loops():
    similarities = []
    l = len(s)
    for i in range(l):
        similarities.append([])
        for j in range(l):
            similarities[i].append(fuzz.token_set_ratio(s[i], s[j]))

a = np.array(s).reshape(-1,1)
def pwd_pdist():
    dm = squareform(pdist(a, pwd))

%time pwd_loops()
#Wall time: 2.39 s

%time pwd_pdist()
#Wall time: 3.73 s
JBN
  • 67
  • 4
  • Unless you know something deep about the metric and/or the data set, how can you do better than the *n*(*n*-1)/2 comparisons made by the (symmetry-aware) “inefficient for-loops”? – Davis Herring Dec 23 '18 at 21:55
  • I assumed this task could be performed more efficiently after vectorization, though I do not yet have a deep understanding of the process – JBN Dec 23 '18 at 22:11

0 Answers0