I have a collection of 40,000 strings and want to compare their similarity pairwise using fuzz.token_set_ratio()
, but my brain is not wired correctly to do this in an efficient way, even after looking into vectorization.
Here is an example:
from fuzzywuzzy import fuzz
s = ["fuzzy was a strong bear",
"fuzzy was a large bear",
"fuzzy was the strongest bear you could ever imagine"]
similarities = []
l = len(s)
for i in range(l):
similarities.append([])
for j in range(l):
similarities[i].append(fuzz.token_set_ratio(s[i], s[j]))
similarities
Now obviously, this code has at least two short comings. First, it uses inefficient for-loops. Second, while the resulting similarities
matrix is symmetric (this is not always true, but ignore that for now) and I only need to compute the upper or lower triangle, it computes all elements. The latter is probably something I could code my way around, but I am looking for the quickest way to arrive at similarities
in Python.
Edit: Here is another piece of perhaps useful information. I tried to speed up the process using pdist
, which seems to perform well for some similar tasks. However, in this case, it seems to be slower than my inefficient for-loops for some reason.
Here is the code:
from fuzzywuzzy import fuzz
from scipy.spatial.distance import pdist, squareform
import numpy as np
def pwd(string1, string2):
return fuzz.token_set_ratio(string1, string2)
s = []
for i in range(100):
s.append("fuzzy was a strong bear")
s.append("fuzzy was a large bear")
s.append("fuzzy was the strongest bear you could ever imagine")
def pwd_loops():
similarities = []
l = len(s)
for i in range(l):
similarities.append([])
for j in range(l):
similarities[i].append(fuzz.token_set_ratio(s[i], s[j]))
a = np.array(s).reshape(-1,1)
def pwd_pdist():
dm = squareform(pdist(a, pwd))
%time pwd_loops()
#Wall time: 2.39 s
%time pwd_pdist()
#Wall time: 3.73 s