1

Please see my example below, how can I return the data from the groupby on all 3 levels of the original MultiIndex?

In this example: I want to see the totals by brand. I have now applied a workaround using map (see below, this shows the output that I hope to get straight from the groupby).

brands = ['Tesla','Tesla','Tesla','Peugeot', 'Peugeot', 'Citroen', 'Opel', 'Opel', 'Peugeot', 'Citroen', 'Opel']
years = [2018, 2017,2016, 2018, 2017, 2017, 2018, 2017,2016, 2016,2016]
owners = ['Tesla','Tesla','Tesla','PSA', 'PSA', 'PSA', 'PSA', 'PSA','PSA', 'PSA', 'PSA']
index = pd.MultiIndex.from_arrays([owners, years, brands], names=['owner', 'year', 'brand'])
data = np.random.randint(low=100, high=1000, size=len(index), dtype=int)
weight = np.random.randint(low=1, high=10, size=len(index), dtype=int)
df = pd.DataFrame({'data': data, 'weight': weight},index=index)
df.loc[('PSA', 2017, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Citroen'), 'data'] = np.nan
df.loc[('Tesla', 2016, 'Tesla'), 'data'] = np.nan

out:

                        data    weight
owner   year    brand       
PSA     2016    Citroen NaN     5
                Opel    NaN     5
                Peugeot 250.0   2
        2017    Citroen 469.0   4
                Opel    NaN     5
                Peugeot 768.0   5
        2018    Opel    237.0   6
                Peugeot 663.0   4
Tesla   2016    Tesla   NaN     3
        2017    Tesla   695.0   6
        2018    Tesla   371.0   5

I have tried with the index and "level" as well as with columns and "by". And I have tried with "as_index = False" .sum() as well as with "group_keys()" = False and .apply(sum). But I am not able to get the brand column back in the groupby output:

grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)

out:

        data    weight  group_data
owner   year            
PSA     2016    250.0   12.0    750.0
        2017    1237.0  14.0    3711.0
        2018    900.0   10.0    1800.0
Tesla   2016    0.0     3.0     0.0
        2017    695.0   6.0     695.0
        2018    371.0   5.0     371.0

Similar:

grouped = df.groupby(by=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)

or:

grouped = df.groupby(by=['owner', 'year'], as_index=False, group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.sum()

Workaround:

grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
df_owner_year = grouped.apply(sum)
s_data = df_owner_year['data']
df['group_data'] = df.index.map(s_data)
df

out:

                        data    weight  group_data
owner   year    brand           
PSA     2016    Citroen NaN     5   250.0
                Opel    NaN     5   250.0
                Peugeot 250.0   2   250.0
        2017    Citroen 469.0   4   1237.0
                Opel    NaN     5   1237.0
                Peugeot 768.0   5   1237.0
        2018    Opel    237.0   6   900.0
                Peugeot 663.0   4   900.0
Tesla   2016    Tesla   NaN     3   0.0
        2017    Tesla   695.0   6   695.0
        2018    Tesla   371.0   5   371.0
Wouter
  • 1,296
  • 2
  • 14
  • 32
  • I am not sure I understand, but isn't the result of your workaround the same as you defined in the beginning? Isn't the purpose of an index to uniquely identify the records/fields you are looking for? If so, then grouping by the entire index shouldn't result in any changes, because you are grouping by individuals. – KenHBS May 12 '19 at 12:25
  • @KenHBS thanks. I group by level 0 and 1 of the index and try to get the group totals back in the df with level 2 preserved. – Wouter May 13 '19 at 07:10

2 Answers2

2

You can use groupby to accomplish this.

df = df.sort_index()
print(df)

                     data  weight
owner year brand                 
PSA   2016 Citroen    NaN       4
           Opel       NaN       7
           Peugeot  880.0       1
      2017 Citroen  164.0       2
           Opel       NaN       5
           Peugeot  607.0       8
      2018 Opel     809.0       1
           Peugeot  317.0       8
Tesla 2016 Tesla      NaN       1
      2017 Tesla    384.0       9
      2018 Tesla    550.0       9

Groupby Owner and Year and the make your new column equal to that.

df['new'] = df.groupby(['owner', 'year'])['data'].sum()
print(df)

                    data  weight     new
owner year brand                         
PSA   2016 Citroen    NaN       4   880.0
           Opel       NaN       7   880.0
           Peugeot  880.0       1   880.0
      2017 Citroen  164.0       2   771.0
           Opel       NaN       5   771.0
           Peugeot  607.0       8   771.0
      2018 Opel     809.0       1  1126.0
           Peugeot  317.0       8  1126.0
Tesla 2016 Tesla      NaN       1     0.0
      2017 Tesla    384.0       9   384.0
      2018 Tesla    550.0       9   550.0

EDIT

A further question was asked why when grouping by columns df['new'] returns NaN but the proper values are returned when the grouping is in the index. I posed this question on SO and an excellent answer is here by @Jezrael.

run-out
  • 3,114
  • 1
  • 9
  • 25
  • Thanks. I did not realize that the sum() looses the index from the group by (grouped.groups -> "MultiIndex(levels=[['PSA', 'Tesla'], [2016, 2017, 2018], ['Citroen', 'Opel', 'Peugeot', 'Tesla']],.. . By the immediate assignment of the grouped & summed series result to the original df through the df['new'] = you avoid loosing the index. It is like an immediate map the way I understand it. – Wouter May 13 '19 at 07:33
  • 1
    Sorry, but this does not seem to work for me in other situations. Real simple: df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data']) df['new'] = df.groupby('group')['data'].sum() print(df) -> the new column only contains NaNs. The groupby result without the assignment to the new series looks fine... Adding groupby_result = df.groupby('group')['data'].sum() df['new'] = df['group'].map(groupby_result) does give me te correct new column. Can you please explain? – Wouter May 17 '19 at 08:00
  • 1
    When the grouping is at the index level, the rows fill in as above, but when at the column level, they do not. I do not have a good reason why and i'm going to post a question in SO. – run-out May 17 '19 at 08:52
  • 1
    Question asked: https://stackoverflow.com/questions/56183032/groupby-sum-index-vs-column-results – run-out May 17 '19 at 09:01
  • 1
    Thanks!, I have accepted your answer and to avoid confusion it may be better if you (please) add the reference to- or the outcome of- your new question in the answer? – Wouter May 17 '19 at 10:07
1

I am sure there are cases in which MultiIndex is valuable, but I usually just want to get rid of it as soon as possible, so I'd start off with df = df.reset_index().

Then you can easily group by brand, like for example:

>>> df.groupby('brand').agg({'weight': sum, 'data': sum})
#          weight    data
# brand                  
# Citroen      10   784.0
# Opel         13   193.0
# Peugeot      14  1663.0
# Tesla        18   507.0

Or group by owner and year:

>>> df.groupby(['owner', 'year']).agg({'weight': sum, 'data': sum})
              weight    data
# owner year                
# PSA   2016      17   879.0
#       2017       8  1264.0
#       2018      12   497.0
# Tesla 2016       8     0.0
#       2017       4   151.0
#       2018       6   356.0
KenHBS
  • 6,756
  • 6
  • 37
  • 52
  • Thanks. My example is not clear enough but I need the multi-index here because I want to create new unique rows with the 3 keys. – Wouter May 13 '19 at 07:34