Odd behavior of to_dict

Question

I'm building a fuzzy search program, using FuzzyWuzzy, to find matching names in a dataset. My data is in a DataFrame of about 10378 rows and len(df['Full name']) is 10378, as expected. But len(choices) is only 1695.

I'm running Python 2.7.10 and pandas 0.17.0, in an IPython Notebook.

choices = df['Full name'].astype(str).to_dict()
def fuzzy_search_to_df (term, choices=choices):
    search = process.extract(term, choices, limit=len(choices)) # does the search itself
    rslts = pd.DataFrame(data=search, index=None, columns=['name', 'rel', 'df_ind']) # puts the results in DataFrame form
    return rslts
results = fuzzy_search_to_df(term='Ben Franklin') # returns the search result for the given term
matches = results[results.rel > 85] # subset of results, these are the best search results
find = df.iloc[matches['df_ind']] # matches in the main df

As you can probably tell, I'm getting the index of the result in the choices dict as df_ind, which I had assumed would be the same as the index in the main dataframe.

I'm fairly certain that the issue is in the first line, with the to_dict() function, as len(df['Full name'].astype(str)results in 10378 and len(df['Full name'].to_dict()) results in 1695.

1695. Does that mean that there are 1695 names and the rest are duplicates? — nocoolsoft, Oct 26 '15 at 05:20
@nocoolsoft yes, as vignesh explained correctly, you cannot have duplicate keys in dictionary, but you have duplicate `index`es in your dataframe, hence the duplicate indexes get overwritten. Are the indexes important? — Anand S Kumar, Oct 26 '15 at 05:37
@AnandSKumar using `choices = dict(zip(df['n'],df['Full name'].astype(str)))`, where df['n'] is np.arange(len(df)), worked fine and got what I needed. Had some indexing issues because I was importing the data from different Excel spreadsheets. How do I give you credit for your help? — nocoolsoft, Oct 26 '15 at 06:06
@nocoolsoft Why make it that complex and slow, if you want that you can simply do - `df.reset_index()['Full name'].astype(str).to_dict()` and get the same thing, most probably much faster as well. — Anand S Kumar, Oct 26 '15 at 06:08

Anand S Kumar · Accepted Answer · 2015-10-26T06:47:20.157

The issue is that you have multiple rows in your dataframe, where the index is the same, hence since a Python dictionary can only hold a single value for a single key , and in Series.to_dict() method, the index is used as the key, the values from those rows get overwritten by the values that come later.

A very simple example to show this behavior -

In [36]: df = pd.DataFrame([[1],[2]],index=[1,1],columns=['A'])

In [37]: df
Out[37]:
   A
1  1
1  2

In [38]: df['A'].to_dict()
Out[38]: {1: 2}

This is what is happening in your case, and noted from the comments, since the amount of unique values for the index are only 1695 , we can confirm this by testing the value of len(df.index.unique()) .

If you are content with having numbers as key (the index of the dataframe) , then you can reset the indexes using DataFrame.reset_index() , and then use .to_dict() on that. Example -

choices = df.reset_index()['Full name'].astype(str).to_dict()

Demo from above example -

In [40]: df.reset_index()['A'].to_dict()
Out[40]: {0: 1, 1: 2}

This is the same the solution OP found - choices = dict(zip(df['n'],df['Full name'].astype(str))) (as can be seen from the comments) - but this method would be faster than using zip and dict .

This saved my life, thank you so much! Spent hours on it already and this just saved me the rest of the day :-) — Nick Woodhams, Jun 21 '17 at 16:51

Odd behavior of to_dict

1 Answers1