Creating a dict with nlargest out of a Pandas dataframe grouping by two different criteria

Question

This question is a follow-up from this other question.

I've got this dataframe (slightly different from the one in the other question - I added a "state" column now):

    person_code  #CNAE   growth   size    state
0           231     32     0.54     32       FR
1           233     43     0.12    333       LK
2           432     32     0.44     21       FR
3           431     56     0.32     23       KS
4           654     89     0.12     89       FR
5           764     32     0.20    211       TI
6           434     32     0.82     90       TI

Each row is a different person. I wanted to create a df that looks like this:

    person_code  #CNAE   growth   size  state              top3growth ...
0 .         231     32     0.54     32     FR     [dict_top3_type_32]
1 .         233     43     0.12    333     LK     [dict_top3_type_43]
2 .         432     32     0.44     21     FR     [dict_top3_type_32]                     
3 .         431     56     0.32     23     KS     [dict_top3_type_56]
4 .         654     89     0.12     89     FR     [dict_top3_type_89]
5 .         764     32     0.20    211     TI     [dict_top3_type_32]
6 .         434     32     0.82     90     TI     [dict_top3_type_32]

For different reasons we've agreed that the best way was to create different dictionaries. Each dict would comprise the 3 persons with the same "#CNAE" that had the largest "growth" rate. They looked like this:

                                          top3growth  
0  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...  
1  {'person_code': {1: 233}, 'growth': {1: 0.12},...  
2  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...  
3  {'person_code': {3: 431}, 'growth': {3: 0.32},...  
4  {'person_code': {4: 654}, 'growth': {4: 0.12},...  
5  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...  
6  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...

I was able to achieve that thanks to the excellent solution suggested by @Wen:

a=df.groupby('#CNAE',group_keys=False).apply(pd.DataFrame.nlargest,n=3,columns='growth')
df['top3growth']=df['#CNAE'].map(a.groupby('#CNAE').apply(lambda x : x.to_dict()))

My problem now is that I need to create a new column of dictionaries. But now I need to find the 3 largest growth persons that have the same "#CNAE" and "state" (not only the same #CNAE).

I've made a few attempts trying to groupby '#CNAE' and 'state', like this one below, but it didn't work out:

a=df.groupby(['#CNAE','state'],group_keys=False).apply(pd.DataFrame.nlargest,n=3,columns='growth')
df['top3growthCNAEstate']=df[['#CNAE','state']].map(a.groupby(['#CNAE','state']).apply(lambda x : x.to_dict()))

I understand the solution would look much like this but I can't get it to work out. Would anyone have a suggestion?

score 1 · Accepted Answer · answered Mar 18 '18 at 13:42

I think need join with rename for set new column name:

b = a.groupby(['#CNAE','state']).apply(lambda x : x.to_dict()).rename('top3growthCNAEstate')
df1 = df.join(b, on=['#CNAE','state'])
print (df1)
   person_code  #CNAE  growth  size state  \
0          231     32    0.54    32    FR   
1          233     43    0.12   333    LK   
2          432     32    0.44    21    FR   
3          431     56    0.32    23    KS   
4          654     89    0.12    89    FR   
5          764     32    0.20   211    TI   
6          434     32    0.82    90    TI   

                                 top3growthCNAEstate  
0  {'person_code': {0: 231, 2: 432}, '#CNAE': {0:...  
1  {'person_code': {1: 233}, '#CNAE': {1: 43}, 'g...  
2  {'person_code': {0: 231, 2: 432}, '#CNAE': {0:...  
3  {'person_code': {3: 431}, '#CNAE': {3: 56}, 'g...  
4  {'person_code': {4: 654}, '#CNAE': {4: 89}, 'g...  
5  {'person_code': {6: 434, 5: 764}, '#CNAE': {6:...  
6  {'person_code': {6: 434, 5: 764}, '#CNAE': {6:...

Creating a dict with nlargest out of a Pandas dataframe grouping by two different criteria

1 Answers1