I'm building a fuzzy search program, using FuzzyWuzzy, to find matching names in a dataset. My data is in a DataFrame of about 10378 rows and len(df['Full name'])
is 10378, as expected. But len(choices)
is only 1695.
I'm running Python 2.7.10
and pandas 0.17.0
, in an IPython Notebook.
choices = df['Full name'].astype(str).to_dict()
def fuzzy_search_to_df (term, choices=choices):
search = process.extract(term, choices, limit=len(choices)) # does the search itself
rslts = pd.DataFrame(data=search, index=None, columns=['name', 'rel', 'df_ind']) # puts the results in DataFrame form
return rslts
results = fuzzy_search_to_df(term='Ben Franklin') # returns the search result for the given term
matches = results[results.rel > 85] # subset of results, these are the best search results
find = df.iloc[matches['df_ind']] # matches in the main df
As you can probably tell, I'm getting the index of the result in the choices
dict as df_ind
, which I had assumed would be the same as the index in the main dataframe.
I'm fairly certain that the issue is in the first line, with the to_dict()
function, as len(df['Full name'].astype(str)
results in 10378 and len(df['Full name'].to_dict())
results in 1695.