I have a list of strings that are messy, and I want to find, for each one, its best match from a list of cleanly-formatted strings, which also contains metadata about each. The strings in the messy list are repeated randomly through the list (typically with alternative spellings of the string). The list of messy strings is so long that looping through fuzzywuzzy is not feasible.
I've been trying to use match_most_similar
from the string_grouper
library. When I apply the function using this code:
import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper
new_strings = pd.Series(df['Cited'])
# Create all matches:
matches = match_most_similar(data['caseName'], new_strings)
# Display the results:
pd.DataFrame({'new_strings': new_strings, 'duplicates': matches})
the match_most_similar
function returns the original strings, rather than their matches from the clean list. (data['caseName']
is the clean list of strings.) Here is the output. None of what appears in duplicates
is from data['caseName']
.
pd.DataFrame({'new_strings': new_strings, 'duplicates': matches})
new_strings duplicates
0 Ashwander v. Tennessee Valley Authority,.txt Ashwander v. Tennessee Valley Authority,.txt
1 Bell v. Hood,.txt Bell v. Hood,.txt
2 Charles River Bridge v. Warren Bridge,.txt Charles River Bridge v. Warren Bridge,.txt
Does anyone know what I must be doing wrong?
For reference, new_strings
looks like this (I have limited it to just 3 elements for the post):
0 Ashwander v. Tennessee Valley Authority,.txt
1 Bell v. Hood,.txt
2 Charles River Bridge v. Warren Bridge,.txt
Name: Cited, dtype: object
and data['caseName']
looks like this:
data['caseName']
0 HALLIBURTON OIL WELL CEMENTING CO. v. WALKER e...
1 CLEVELAND v. UNITED STATES
2 CHAMPLIN REFINING CO. v. UNITED STATES ET AL.
3 UNITED STATES v. ALCEA BAND OF TILLAMOOKS ET AL.
4 UNITED STATES v. HOWARD P. FOLEY CO., INC.
...
9025 DEPARTMENT OF HOMELAND SECURITY v. THURAISSIGIAM
9026 SEILA LAW LLC v. CONSUMER FINANCIAL PROTECTION...
9027 LIU v. SECURITIES AND EXCHANGE COMMISSION
9028 COLORADO DEPT. OF STATE v. BACA
9029 TRUMP v. MAZARS USA, LLP
Name: caseName, Length: 9030, dtype: object```