0

I've got this dataframe:

    person_code  #CNAE   growth   size 
0           231     32     0.54     32
1           233     43     0.12    333
2           432     32     0.44     21
3           431     56     0.32     23
4           654     89     0.12     89
5           764     32     0.20    211
6           434     32     0.82     90

I need to create a new column called "top3growth". For that I will need to check df's #CNAE for each row and add an extra column pointing out which are the 3 persons with highest growth for that CNAE (it will add a dataframe inside the df dataframe). To create the "top3dfs" I'm using this groupby:

a=sql2.groupby('#CNAE',group_keys=False).apply(pd.DataFrame.nlargest,n=3,columns='growth')

(This solution came out of this question.)

It should look like this:

    person_code  #CNAE   growth   size              top3growth ...
0 .         231     32     0.54     32       [df_top3_type_32]
1 .         233     43     0.12    333       [df_top3_type_43]
2 .         432     32     0.44     21       [df_top3_type_32]                     
3 .         431     56     0.32     23       [df_top3_type_56]
4 .         654     89     0.12     89       [df_top3_type_89]
5 .         764     32     0.20    211       [df_top3_type_32]
6 .         434     32     0.82     90       [df_top3_type_32]
...

df_top3_type_32 should look like this (for example):

     person_code  #CNAE  growth  size
6 .          434    32    0.82    90
0 .          231    32    0.54    32
2 .          432    32    0.44    21

I'm trying to solve my problem by using:

df['top3growth']=np.nan
for i in df.index:
    df['top3growth'].loc[i]=a[a['#CNAE'] == df['#CNAE'].loc[i]]

But I'm getting:

ValueError: Incompatible indexer with DataFrame

Does anyone know what's going on? Is there a more efficient way of doing this (not using a for loop)?

aabujamra
  • 4,494
  • 13
  • 51
  • 101
  • Possible duplicate of [Pandas: DataFrame within DataFrame](https://stackoverflow.com/questions/17954520/pandas-dataframe-within-dataframe) – jpp Mar 05 '18 at 01:37
  • (Aside: the community has discussed [home-made tags in titles](https://meta.stackoverflow.com/questions/253028/why-is-removing-tags-from-the-title-suggested-so-often), and has decided that titles are better as natural English. Use the tag system for tags. Thank you). – halfer Mar 05 '18 at 01:37
  • @jpp that has nothing to do with that question. – aabujamra Mar 05 '18 at 01:38
  • Yes, it does. It explains clearly why what you want to do is a bad idea. This is a classic XY problem. Explain what you are trying to achieve, not how to work an inadvisable method. – jpp Mar 05 '18 at 01:40
  • @jpp does it explain why I'm getting this keyerror? I definitely don't see it there – aabujamra Mar 05 '18 at 01:41
  • "Why am I getting this error?" is not a [mcve]. You need to explain what you are trying to achieve. Have a look at [What is the XY problem?](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – jpp Mar 05 '18 at 01:42
  • @jpp ain't I explaining what I'm trying to achieve? "I need to create a new column called "top3growth". For that I will need to check df's #CNAE for each row and add an extra column pointing out which are the 3 persons with highest growth for that CNAE (it will add a dataframe inside the df dataframe). To create the "top3dfs" I'm using this groupby:" – aabujamra Mar 05 '18 at 01:47
  • I suggest update your title, avoid all reference to dataframe inside dataframes, provide input & desired output for sample data. Until then, yes you aren't explaining. – jpp Mar 05 '18 at 01:49
  • 1
    Maybe, `df.loc[i,'top3growth'] = a[a['#CNAE'] == df.at[i,'#CNAE']]` – DJK Mar 05 '18 at 01:49
  • @DJK you mean that inside the for loop or outside? – aabujamra Mar 05 '18 at 02:01
  • Thats a replacement for the line you have in the for loop – DJK Mar 05 '18 at 02:02
  • '#CNAE' in a is duplicated . since you get the lagest 3 , how can you map it back ? – BENY Mar 05 '18 at 02:02
  • @Wen I'm not sure if I understood your question – aabujamra Mar 05 '18 at 02:04

1 Answers1

0

There is one way, convert a to dict , then map it back

#a=df.groupby('#CNAE',group_keys=False).apply(pd.DataFrame.nlargest,n=3,columns='growth')
df['top3growth']=df['#CNAE'].map(a.groupby('#CNAE').apply(lambda x : x.to_dict()))
df
Out[195]: 
   person_code  #CNAE  growth  size  \
0          231     32    0.54    32   
1          233     43    0.12   333   
2          432     32    0.44    21   
3          431     56    0.32    23   
4          654     89    0.12    89   
5          764     32    0.20   211   
6          434     32    0.82    90   
                                          top3growth  
0  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...  
1  {'person_code': {1: 233}, 'growth': {1: 0.12},...  
2  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...  
3  {'person_code': {3: 431}, 'growth': {3: 0.32},...  
4  {'person_code': {4: 654}, 'growth': {4: 0.12},...  
5  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...  
6  {'person_code': {0: 231, 2: 432, 6: 434}, 'gro...  

After create your new column , if you want to convert the single cell back to dataframe

pd.DataFrame(df.top3growth[0])
Out[197]: 
   #CNAE  growth  person_code  size
0     32    0.54          231    32
2     32    0.44          432    21
6     32    0.82          434    90
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks @Wen. That df['top3growth']=df['#CNAE'].map(a.groupby('#CNAE').apply(lambda x : x.to_dict())) would be inside the for loop right? – aabujamra Mar 05 '18 at 02:06
  • 1
    @abutremutante nope , you do not need for loop here, you can think about it is merge :-) – BENY Mar 05 '18 at 02:07
  • Seems like a good solution but I'm not sure if your line up there filters the top 3 persons in growth (df is a very large dataframe). I will try it out – aabujamra Mar 05 '18 at 02:10
  • @abutremutante I am using your own code to create the data a here – BENY Mar 05 '18 at 02:12
  • I was trying to fit this in the groupby line: .apply(pd.DataFrame.nlargest,n=3,columns='growth') but couldn't make it. Is there a way to include that in here? That's the only missing part – aabujamra Mar 05 '18 at 02:48
  • @abutremutante you have it already `a=sql2.groupby('#CNAE',group_keys=False).apply(pd.DataFrame.nlargest,n=3,columns='growth')`, my 'a' is create by this line – BENY Mar 05 '18 at 02:54
  • ohh, that's right. I'm already burned out sorry. outstanding solution, saved me here!! – aabujamra Mar 05 '18 at 03:02