I have a list of strings, and I want to build a dataframe which gives the Jaro-Winkler normalized similarity between each pair of strings. There is a function in the package textdistance to compute it. Loosely, similar strings have a score close to 1, and different strings have a score close to 0. My actual list of strings has about 4000 strings, so there are nearly 8 million pairs of strings to compare.
This seems like an "embarassingly parallel" computation to me. Is there some way to do this in dask
? A bonus would be to have a tqdm
-style progressbar with an ETA.
from itertools import combinations
import pandas as pd
import textdistance
strings = ["adsf", "apple", "apples", "banana"]
def similarity(left: str, right: str) -> float:
"""
Computes Jaro-Winkler normalized_similarity, which is between 0 and 1.
More similar strings have a score closer to 1.
"""
return textdistance.jaro_winkler.normalized_similarity(left, right)
generator = (
(left, right, similarity(left, right)) for left, right in combinations(strings, 2)
)
df = pd.DataFrame(generator, columns=["left", "right", "sim_score"])
Sample of df
:
left right sim_score
0 adsf apple 0.483333
1 adsf apples 0.472222
2 adsf banana 0.472222
3 apple apples 0.966667
4 apple banana 0.455556
5 apples banana 0.444444