Groupby to create a list

Question

I am using JupyterLab to print some data in a spreadsheet in a specific way.

I have two different files:

1) 2)

For every original_id == id I want to group by country and list the brands and summing and listing the holding for each brand.

The result I got with my code is this:

FundID  Domicile (brand, AUM)
0   A1      IT (BBB, 10.0), UK (BBB, 7.0),
1   B2      CH (AAA, 12.0),
2   C3      DE (CCC, 5.0),
3   D4      CH (EEE, 9.0), UK (EEE, 11.0),

While, my objective is to get something like this:

The code is

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 500)

df_fofs = pd.read_excel('SampleDF.xlsx')
df_extract = pd.read_excel('SampleID_ex.xlsx')

df_extract

original_id
0   A1
1   B2
2   C3
3   D4

df_fofs

    brand   country id  holding
0   AAA       UK    A1  2000000
1   AAA       CH    B2  4000000
2   BBB       UK    A1  7000000
3   CCC       DE    C3  5000000
4   BBB       IT    A1  10000000
5   EEE       UK    D4  11000000
6   EEE       CH    D4  3000000
7   EEE       CH    D4  6000000
8   AAA       CH    B2  8000000

fund_ids = list(df_extract['original_id'])

result = {}
for fund in fund_ids:

    temp = []

    df_funds = df_fofs[(df_fofs['id'] == fund )][['country', 'brand', 'holding']]
    domicile_fof = df_fofs[df_fofs['id'] == fund ][['country', 'holding']]

    df_funds = df_funds.groupby(['country', 'brand'])["holding"].sum()
    domicile_fof = domicile_fof.groupby('country')["holding"].sum()

    s = ''

    for i in range(len(df_funds)):
        row = df_funds.reset_index().iloc[i]
        if row['holding'] >= 5000000:
            s += row['country'] + ' (' + str(row['brand']) + ', ' + str(round(((row['holding'])/1000000), 2)) + '), '

    result[fund] = [s]

df_result = pd.DataFrame.from_dict(result, orient = 'index')
df_result.reset_index(inplace = True)
df_result.columns = ['FundID', 'Domicile (brand, AUM)']

df_result

  FundID    Domicile (brand, AUM)
0   A1      IT (BBB, 10.0), UK (BBB, 7.0),
1   B2      CH (AAA, 12.0),
2   C3      DE (CCC, 5.0),
3   D4      CH (EEE, 9.0), UK (EEE, 11.0),

@VasylKolomiets Check the actual and expected output provided by OP. — shaik moeed, Nov 14 '19 at 10:24

splash58 · Accepted Answer · 2019-11-14T14:02:43.743

1

You can combine tables on id, group by id and country to make inner items and then save grouping by id only to create the outer level

def f(x):
    n = x.apply(lambda r: '{} ({})'.format(r['brand'],int(r['holding']/1000000)), axis=1)
    return '{} [{}]'.format(x.iloc[0]['country'],', '.join(n))


df_extract.merge(df_fofs, left_on='original_id', right_on='id') 
   .groupby(['original_id','country']).apply(f) \ 
   .groupby(level=0).apply(', '.join)

original_id
A1    IT [BBB (10)], UK [AAA (2), BBB (7)]
B2                   CH [AAA (4), AAA (8)]
C3                            DE [CCC (5)]
D4    CH [EEE (3), EEE (6)], UK [EEE (11)]
dtype: object

edited Nov 14 '19 at 14:02

answered Nov 14 '19 at 10:37

splash58

26,043
3
22
34

Thank you @splash58 but in the final output I need to have the same brand name just once for each country with the sum of holding in parenthesis. For example, for B2 it should be CH [ AAA (12) ] . Moreover, in my original file I have much more data and I get the following error: ('cannot convert float NaN to integer', 'occurred at index 83') How can I solve this? Thank you! – Emanuela Masucci Nov 15 '19 at 10:41

shaik moeed · Answer 2 · 2019-11-14T10:51:23.290

0

Try this,

>>> df   
  brand country  id  holding
0   AAA      UK  A1        2
1   AAA      CH  B2        4
2   BBB      UK  A1        7
3   CCC      DE  C3        5
4   BBB      IT  A1       10
5   EEE      UK  D4       11
6   EEE      CH  D4        3
7   EEE      CH  D4        6
8   AAA      CH  B2        8

>>> final_df = df.groupby(by='id').apply(lambda x: x.groupby(by='country')
                                                    .apply(lambda y: y.groupby(by='brand')
                                                  .agg(sum))).reset_index()

>>> final_df.groupby(by='id')\
        .apply(lambda x: ", ".join([f"{row['country']} [{row['brand']}({row['holding']})]"
                                      for _, row in x.iterrows()]))

id
A1    IT [BBB(10)], UK [AAA(2)], UK [BBB(7)]
B2                              CH [AAA(12)]
C3                               DE [CCC(5)]
D4                 CH [EEE(9)], UK [EEE(11)]
dtype: object

edited Nov 14 '19 at 10:51

answered Nov 14 '19 at 10:45

shaik moeed

5,300
1
18
54

Thank you so much! – Emanuela Masucci Nov 14 '19 at 17:21
How can I put the second column of the output in inn only one column in the dataframe? so that I can export it to an excel file having the list for each id. Thank you! – Emanuela Masucci Nov 14 '19 at 18:27
@EmanuelaMasucci For final output, do `.to_frame("your_column_name").reset_index()`. – shaik moeed Nov 15 '19 at 04:50

score 0 · Answer 3 · answered Nov 14 '19 at 10:46

What you are looking for is the function pandas.DataFrame.pivot_table. Documentation here pandas/pivot_table.

This code solves your example (instead of lists I used multiIndex)

import pandas as pd
df = pd.DataFrame([
    ('AAA','UK','A1',2000000),
    ('AAA','CH','B2',4000000),
    ('BBB','UK','A1',7000000),
    ('CCC','DE','C3',5000000),
    ('BBB','IT','A1',10000000),
    ('EEE','UK','D4',11000000),
    ('EEE','CH','D4',3000000),
    ('EEE','CH','D4',6000000),
    ('AAA','CH','B2',8000000)],
    columns=['brand',   'country', 'id',  'holding'])
df.pivot_table(values='holding',index=['id','country','brand'])

The result is resulting dataframe

Groupby to create a list

3 Answers3