1

I have a list of strings that are messy, and I want to find, for each one, its best match from a list of cleanly-formatted strings, which also contains metadata about each. The strings in the messy list are repeated randomly through the list (typically with alternative spellings of the string). The list of messy strings is so long that looping through fuzzywuzzy is not feasible.

I've been trying to use match_most_similar from the string_grouper library. When I apply the function using this code:

import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper
new_strings = pd.Series(df['Cited'])
# Create all matches:
matches = match_most_similar(data['caseName'], new_strings)

# Display the results:
pd.DataFrame({'new_strings': new_strings, 'duplicates': matches})

the match_most_similar function returns the original strings, rather than their matches from the clean list. (data['caseName'] is the clean list of strings.) Here is the output. None of what appears in duplicates is from data['caseName'].

pd.DataFrame({'new_strings': new_strings, 'duplicates': matches})
                                  new_strings                                    duplicates
0  Ashwander v. Tennessee Valley Authority,.txt  Ashwander v. Tennessee Valley Authority,.txt
1                             Bell v. Hood,.txt                             Bell v. Hood,.txt
2    Charles River Bridge v. Warren Bridge,.txt    Charles River Bridge v. Warren Bridge,.txt

Does anyone know what I must be doing wrong?

For reference, new_strings looks like this (I have limited it to just 3 elements for the post):

0    Ashwander v. Tennessee Valley Authority,.txt
1                               Bell v. Hood,.txt
2      Charles River Bridge v. Warren Bridge,.txt
Name: Cited, dtype: object

and data['caseName'] looks like this:

data['caseName']
0       HALLIBURTON OIL WELL CEMENTING CO. v. WALKER e...
1                              CLEVELAND v. UNITED STATES
2           CHAMPLIN REFINING CO. v. UNITED STATES ET AL.
3        UNITED STATES v. ALCEA BAND OF TILLAMOOKS ET AL.
4              UNITED STATES v. HOWARD P. FOLEY CO., INC.
                              ...                        
9025     DEPARTMENT OF HOMELAND SECURITY v. THURAISSIGIAM
9026    SEILA LAW LLC v. CONSUMER FINANCIAL PROTECTION...
9027            LIU v. SECURITIES AND EXCHANGE COMMISSION
9028                      COLORADO DEPT. OF STATE v. BACA
9029                             TRUMP v. MAZARS USA, LLP
Name: caseName, Length: 9030, dtype: object```
Tom Clark
  • 11
  • 3
  • Since FuzzyWuzzy is to slow you should give https://github.com/maxbachmann/RapidFuzz a try (I am the author). It uses the same metrics, but is a lot faster. – maxbachmann Feb 25 '21 at 23:40
  • Thank you! ```rapidfuzz``` decreased the time by about 70%. That was helpful. – Tom Clark Mar 04 '21 at 14:45

1 Answers1

2

I’m a contributor to string_grouper and I’m sorry I didn’t notice your question sooner.

match_most_similar returns the original string if it finds no match for it above the similarity-threshold (min_similarity) whose default value is 0.8. So you could try lowering its value to see if you get a more suitable result. For example,

matches = match_most_similar(data['caseName'],
   new_strings,
   min_similarity=0.6)

Bear in mind that the lowest value of min_similarity is 0 and its highest 1. Furthermore, typically, the lower the similarity-threshold value is, the longer it takes match_most_similar to run.

Also see https://github.com/Bergvca/string_grouper#kwargs for a list of other options you can tweak to improve your result.