I'm trying to compare two lists of strings and produce similarity metrics between both lists. The lists are of unequal length, one is roughly 50,000 the other is about 3,000.
But here are two MWE data frames that are similar to my data:
forbes = pd.DataFrame(
{
"company_name": [
"Deloitte",
"PriceWaterhouseCoopers",
"KPMG",
"Ernst & Young",
"intentionall typo company XYZ",
],
"revenue": [100, 200, 300, 250, 400],
}
)
sf = pd.DataFrame(
{"salesforce_name": ["Deloite", "PriceWaterhouseCooper"], "CEO": ["John", "Jane"]}
)
Here's how I produce similarity metrics. I'm using two for loops to calculate the similarity between each possible combination of strings, which with my actual data is: 50,000 * 3,000 = 150 million. That's a lot and I feel like there's a smarter way of doing this but don't know what that is.
Here's my implementation:
from fuzzywuzzy import fuzz
scores = pd.DataFrame()
for friend in forbes["company_name"]:
for address in sf["salesforce_name"]:
r = fuzz.ratio(friend, address) # Levenshtein distance
pr = fuzz.partial_ratio(friend, address) # partial ratio
tsr = fuzz.token_sort_ratio(friend, address) #
tser = fuzz.token_set_ratio(friend, address) # ignores duplicated words
if r > 80 or pr > 80 or tsr > 80 or tser > 80:
scores = pd.concat(
[
scores,
pd.DataFrame.from_records(
[
{
"forbes_company": friend,
"salesforce_company": address,
"r": r,
"pr": pr,
"tsr": tsr,
"tser": tser,
}
]
),
]
)