Fuzzywuzzy to compare two lists of strings of unequal length and save multiple similarity metrics

Question

I'm trying to compare two lists of strings and produce similarity metrics between both lists. The lists are of unequal length, one is roughly 50,000 the other is about 3,000.

But here are two MWE data frames that are similar to my data:

forbes = pd.DataFrame(
    {
        "company_name": [
            "Deloitte",
            "PriceWaterhouseCoopers",
            "KPMG",
            "Ernst & Young",
            "intentionall typo company XYZ",
        ],
        "revenue": [100, 200, 300, 250, 400],
    }
)

sf = pd.DataFrame(
    {"salesforce_name": ["Deloite", "PriceWaterhouseCooper"], "CEO": ["John", "Jane"]}
)

Here's how I produce similarity metrics. I'm using two for loops to calculate the similarity between each possible combination of strings, which with my actual data is: 50,000 * 3,000 = 150 million. That's a lot and I feel like there's a smarter way of doing this but don't know what that is.

Here's my implementation:

from fuzzywuzzy import fuzz

scores = pd.DataFrame()
for friend in forbes["company_name"]:
    for address in sf["salesforce_name"]:
        r = fuzz.ratio(friend, address)  # Levenshtein distance
        pr = fuzz.partial_ratio(friend, address)  # partial ratio
        tsr = fuzz.token_sort_ratio(friend, address)  #
        tser = fuzz.token_set_ratio(friend, address)  # ignores duplicated words
        if r > 80 or pr > 80 or tsr > 80 or tser > 80:
            scores = pd.concat(
                [
                    scores,
                    pd.DataFrame.from_records(
                        [
                            {
                                "forbes_company": friend,
                                "salesforce_company": address,
                                "r": r,
                                "pr": pr,
                                "tsr": tsr,
                                "tser": tser,
                            }
                        ]
                    ),
                ]
            )

As a side note, creating a dataframe just at the end will be faster than concatenating all the times you find a match — Tortar, Dec 01 '22 at 00:41

score 1 · Answer 1 · answered Dec 07 '22 at 19:38

With the following toy dataframes (1,000 rows of random words each):

from random import choice

import pandas as pd
import requests


response = requests.get("https://www.mit.edu/~ecprice/wordlist.10000")
WORDS = [word.decode("utf-8") for word in response.content.splitlines()]

fbs = pd.DataFrame(
    {
        "forbes_name": [choice(WORDS) for _ in range(5_000)],
    }
)
sf = pd.DataFrame({"salesforce_name": [choice(WORDS) for _ in range(5_000)]})

Your code outputs the following dataframe in 157 seconds (average on 10 runs) on my machine:

print
  forbes_name salesforce_name  ratio  token_set_ratio  token_sort_ratio  \
0       salad           salad    100              100               100   
1   scientist       scientist    100              100               100   
2   scientist       scientist    100              100               100   
3      person        personal     86               86                86   
4     analyst         analyst    100              100               100 
...  

   partial_ratio  
0            100  
1            100  
2            100  
3            100  
4            100
...

One way to speed up things is to not evaluate all ratios at once, but successively, in a recursive fashion, on a reduced dataset after each run, so that fuzz.ratio (the fastest one) is calculated on all possible combinations, then fuzz.token_set_ratio only on combinations < 80 with fuzz.ratio, etc.

Once the similar rows have been identified, you just have to fill in the missing metrics on them (instead of all possible combinations).

To do that, you can define the following helper functions:

from fuzzywuzzy import fuzz


def get_similar_values(s, other_s, t):
    funcs = [
        fuzz.partial_ratio,
        fuzz.token_sort_ratio,
        fuzz.token_set_ratio,
        fuzz.ratio,
    ]  # ordered from slowest (partial_ratio) to fastest (ratio, which will run first)

    def evaluate(funcs, series, other_series, threshold, scores=None):
        if not funcs:
            return scores

        scores = scores or []
        idx = []
        other_idx = []
        func = funcs.pop()

        for i, val in enumerate(series):
            for j, other_val in enumerate(other_series):
                if (score := func(val, other_val)) > threshold:
                    scores.append(
                        {
                            series.name: val,
                            other_series.name: other_val,
                            func.__name__: score,
                        }
                    )
                    idx.append(i)
                    other_idx.append(j)
        series = series.drop(idx).reset_index(drop=True)
        other_series = other_series.drop(other_idx).reset_index(drop=True)
        return evaluate(funcs, series, other_series, threshold, scores)

    return evaluate(funcs, s, other_s, t)

def add_missing_metrics(similar_values):
    for func in [
        fuzz.token_set_ratio,
        fuzz.token_sort_ratio,
        fuzz.partial_ratio,
        fuzz.ratio,
    ]:
        for one_pair in similar_values:
            if func.__name__ not in one_pair.keys():
                one_pair[func.__name__] = func(
                    one_pair["forbes_name"], one_pair["salesforce_name"]
                )
    return similar_values

Then:

similar_values = get_similar_values(fbs["forbes_name"], sf["salesforce_name"], 80)
similar_values = add_missing_metrics(similar_values)
df = pd.DataFrame.from_records(similar_values)

This produces the same dataframe as before, but in 58 seconds (average on 10 runs), which is ~2.5 times faster.

score 1 · Answer 2 · answered Dec 09 '22 at 15:01

I addition to the improvement proposed by @Laurent, you can replace your usage of fuzzywuzzy with rapidfuzz. This is as simply as replacing your import with:

from rapidfuzz import fuzz

On my machine this achieves the following runtime:

implementation	time
original implementation	6m 16s
Laurents implementation	55s
original implementation + rapidfuzz	2m 48s
Laurents implementation + rapidfuzz	8s

Fuzzywuzzy to compare two lists of strings of unequal length and save multiple similarity metrics

2 Answers2