4

I'm building a fuzzy search program, using FuzzyWuzzy, to find matching names in a dataset. My data is in a DataFrame of about 10378 rows and len(df['Full name']) is 10378, as expected. But len(choices) is only 1695.

I'm running Python 2.7.10 and pandas 0.17.0, in an IPython Notebook.

choices = df['Full name'].astype(str).to_dict()
def fuzzy_search_to_df (term, choices=choices):
    search = process.extract(term, choices, limit=len(choices)) # does the search itself
    rslts = pd.DataFrame(data=search, index=None, columns=['name', 'rel', 'df_ind']) # puts the results in DataFrame form
    return rslts
results = fuzzy_search_to_df(term='Ben Franklin') # returns the search result for the given term
matches = results[results.rel > 85] # subset of results, these are the best search results
find = df.iloc[matches['df_ind']] # matches in the main df

As you can probably tell, I'm getting the index of the result in the choices dict as df_ind, which I had assumed would be the same as the index in the main dataframe.

I'm fairly certain that the issue is in the first line, with the to_dict() function, as len(df['Full name'].astype(str)results in 10378 and len(df['Full name'].to_dict()) results in 1695.

nocoolsoft
  • 87
  • 1
  • 5
  • 1
    what is `len(df.index.unique())` ? – Anand S Kumar Oct 26 '15 at 05:18
  • 1
    1695. Does that mean that there are 1695 names and the rest are duplicates? – nocoolsoft Oct 26 '15 at 05:20
  • @nocoolsoft yes dictionaries cannot have duplicate keys :) – The6thSense Oct 26 '15 at 05:35
  • @nocoolsoft yes, as vignesh explained correctly, you cannot have duplicate keys in dictionary, but you have duplicate `index`es in your dataframe, hence the duplicate indexes get overwritten. Are the indexes important? – Anand S Kumar Oct 26 '15 at 05:37
  • @AnandSKumar using `choices = dict(zip(df['n'],df['Full name'].astype(str)))`, where df['n'] is np.arange(len(df)), worked fine and got what I needed. Had some indexing issues because I was importing the data from different Excel spreadsheets. How do I give you credit for your help? – nocoolsoft Oct 26 '15 at 06:06
  • @nocoolsoft Why make it that complex and slow, if you want that you can simply do - `df.reset_index()['Full name'].astype(str).to_dict()` and get the same thing, most probably much faster as well. – Anand S Kumar Oct 26 '15 at 06:08

1 Answers1

3

The issue is that you have multiple rows in your dataframe, where the index is the same, hence since a Python dictionary can only hold a single value for a single key , and in Series.to_dict() method, the index is used as the key, the values from those rows get overwritten by the values that come later.

A very simple example to show this behavior -

In [36]: df = pd.DataFrame([[1],[2]],index=[1,1],columns=['A'])

In [37]: df
Out[37]:
   A
1  1
1  2

In [38]: df['A'].to_dict()
Out[38]: {1: 2}

This is what is happening in your case, and noted from the comments, since the amount of unique values for the index are only 1695 , we can confirm this by testing the value of len(df.index.unique()) .

If you are content with having numbers as key (the index of the dataframe) , then you can reset the indexes using DataFrame.reset_index() , and then use .to_dict() on that. Example -

choices = df.reset_index()['Full name'].astype(str).to_dict()

Demo from above example -

In [40]: df.reset_index()['A'].to_dict()
Out[40]: {0: 1, 1: 2}

This is the same the solution OP found - choices = dict(zip(df['n'],df['Full name'].astype(str))) (as can be seen from the comments) - but this method would be faster than using zip and dict .

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176