3

I have a list of strings, and I want to build a dataframe which gives the Jaro-Winkler normalized similarity between each pair of strings. There is a function in the package textdistance to compute it. Loosely, similar strings have a score close to 1, and different strings have a score close to 0. My actual list of strings has about 4000 strings, so there are nearly 8 million pairs of strings to compare.

This seems like an "embarassingly parallel" computation to me. Is there some way to do this in dask? A bonus would be to have a tqdm-style progressbar with an ETA.

from itertools import combinations

import pandas as pd
import textdistance

strings = ["adsf", "apple", "apples", "banana"]


def similarity(left: str, right: str) -> float:
    """
    Computes Jaro-Winkler normalized_similarity, which is between 0 and 1.

    More similar strings have a score closer to 1.
    """
    return textdistance.jaro_winkler.normalized_similarity(left, right)


generator = (
    (left, right, similarity(left, right)) for left, right in combinations(strings, 2)
)
df = pd.DataFrame(generator, columns=["left", "right", "sim_score"])

Sample of df:

     left   right  sim_score
0    adsf   apple   0.483333
1    adsf  apples   0.472222
2    adsf  banana   0.472222
3   apple  apples   0.966667
4   apple  banana   0.455556
5  apples  banana   0.444444
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
hwong557
  • 1,309
  • 1
  • 10
  • 15
  • Your example `similarity()` function is bad example material, because it could be replaced with [`pd.Series.eq` function/`==` operator](https://pandas.pydata.org/docs/reference/api/pandas.Series.eq.html), which vectorizes. Give us a better non-trivial example `similarity()` function. – smci Jan 07 '22 at 23:44
  • @smci I have the Jaro-Winkler (normalized) string similarity in mind, found [here](https://github.com/life4/textdistance). – hwong557 Jan 08 '22 at 03:55
  • But paste code for a better non-trivial example `similarity()` function here. You need to improve your code example. Currently this is not a good reproducible example ([mcve]). – smci Jan 08 '22 at 04:31
  • Okay you fixed `similarity()` to actually be normalized Jaro-Winkler, this should be reopened. – smci Jan 13 '22 at 04:41

1 Answers1

0

There's lots of ways, but here's another one, using dask.dataframes...

In [1]: import dask, dask.distributed, dask.dataframe as dd, pandas as pd, itertools

In [2]: client = dask.distributed.Client()

In [3]: futures = client.scatter(strings)

In [4]: def similarity(df, left_col: str, right_col: str) -> pd.Series:
   ...:     """
   ...:     Return 0 or 1 if first char of strings are different or equal.
   ...:     """
   ...:     return (
   ...:         df[left_col].str[0] == df[right_col].str[0]
   ...:     ).astype('int64')
   ...:

In [5]: def make_df(s, others):
   ...:     return pd.DataFrame({
   ...:         'left': [s]*len(others),
   ...:         'right': others,
   ...:     })
   ...:

In [6]: dfs = client.map(make_df, futures, others=strings)
In [7]: df = dd.from_delayed(dfs)

The function can then be applied using ddf.map_partitions

In [8]: df['similarity'] = df.map_partitions(
   ...:     similarity, left_col='left', right_col='right', meta='int64',
   ...: )

The result is a dask.dataframe with the properties you want

In [9]: df
Out[9]:
Dask DataFrame Structure:
                 left   right similarity
npartitions=4
               object  object      int64
                  ...     ...        ...
                  ...     ...        ...
                  ...     ...        ...
                  ...     ...        ...
Dask Name: assign, 16 tasks

In [10]: df.compute()
Out[10]:
     left   right  similarity
0    adsf    adsf           1
1    adsf   apple           1
2    adsf  apples           1
3    adsf  banana           0
0   apple    adsf           1
1   apple   apple           1
2   apple  apples           1
3   apple  banana           0
0  apples    adsf           1
1  apples   apple           1
2  apples  apples           1
3  apples  banana           0
0  banana    adsf           0
1  banana   apple           0
2  banana  apples           0
3  banana  banana           1
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54