2

I have a MultiIndex pandas DataFrame after calling groupby, and need to flatten it in a fashion similar to flattening a pivot table. Most of the analysis is already built around the grouped object, so would prefer to not have to refactor into a pivot table.

Consider some dummy data,

dummy_data = pd.DataFrame({'Ccy' : ['EUR', 'EUR', 'CAD', 'CAD', 'EUR', 'EUR', 'CAD', 'EUR'],
                  'Venue' : ['BAML']*5 + ['BARX']*3,
                  'Price': np.abs(np.random.randn(8)),
                  'volume': np.abs(10*np.random.randn(8))
                   }, 
                  index = pd.date_range('7/19/2017', periods=8))

dummy_data.index.name = "datetime"
print dummy_data


>>>            Ccy     Price Venue     volume
datetime                                  
2017-07-19  EUR  1.338521  BAML  11.227553
2017-07-20  EUR  0.882715  BAML   0.307711
2017-07-21  CAD  0.977815  BAML  14.196170
2017-07-22  CAD  1.262272  BAML   0.055213
2017-07-23  EUR  0.752433  BAML   5.315777
2017-07-24  EUR  0.699008  BARX   2.299045
2017-07-25  CAD  1.625567  BARX   6.474822
2017-07-26  EUR  2.122562  BARX   5.026135

The goal is to group data by Ccy and Venue, apply a filter operation to each subgroup, and then flatten the groups back into the format of this original frame. Consider a simple filter that retains rows where price > 0.8

dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8])

>>>                      Ccy     Price Venue     volume
Ccy Venue datetime                                  
CAD BAML  2017-07-21  CAD  0.977815  BAML  14.196170
          2017-07-22  CAD  1.262272  BAML   0.055213
    BARX  2017-07-25  CAD  1.625567  BARX   6.474822
EUR BAML  2017-07-19  EUR  1.338521  BAML  11.227553
          2017-07-20  EUR  0.882715  BAML   0.307711
    BARX  2017-07-26  EUR  2.122562  BARX   5.026135

All I need now is to fill-in/flatten the Ccy and Venue columns such that

Ccy Venue datetime                                  
CAD BAML  2017-07-21  ...                       CAD BAML 2017-07-21  ...
          2017-07-22  ...                       CAD BAML 2017-07-22  ... 
    BARX  2017-07-25  ...    ---> BECOMES --->  CAD BARX 2017-07-25 ...
EUR BAML  2017-07-19  ...                       EUR BAML 2017-07-19 ... 
          2017-07-20  ...                       EUR BAML 2017-07-20  ...

I need to do this because we have a range of plotting utilities that aren't flexible enough to handle grouped data. Shouldn't there be an ungroup() or flatten() method that is reciprocal to the groupby() operation?

Note: the trivial filter in this example could have been applied to the ungrouped data. In reality, my filter is more complex and only makes sense on subgroups.


Solution Attempts (Attempts 1 and 2 from pivot table solution)

Attempt 1: melt method loses the datetime index.

    print dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8]).melt()

>>>   variable      value
0       Ccy        CAD
1       Ccy        CAD
2       Ccy        CAD
3       Ccy        EUR
4       Ccy        EUR
5       Ccy        EUR
6     Price   0.977815
7     Price    1.26227
8     Price    1.62557
9     Price    1.33852

Attempt 2: the accepted answer leads to a KeyError

dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8]).unstack().reset_index().drop('level_0', axis=1)

KeyError: 'level_0'

Attempt 3: reset_index() leads to ValueError

dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8]).reset_index()

ValueError: cannot insert Venue, already exists

Attempt 4: as_index=False (without group_keys keyword)

out = dummy_data.groupby(['Ccy', 'Venue'], as_index=False).apply(lambda x: x[x['Price'] > 0.8])
print out
print out.index

              Ccy     Price Venue     volume
  datetime                                  
0 2017-07-21  CAD  0.977815  BAML  14.196170
  2017-07-22  CAD  1.262272  BAML   0.055213
1 2017-07-25  CAD  1.625567  BARX   6.474822
2 2017-07-19  EUR  1.338521  BAML  11.227553
  2017-07-20  EUR  0.882715  BAML   0.307711
3 2017-07-26  EUR  2.122562  BARX   5.026135
MultiIndex(levels=[[0, 1, 2, 3], [2017-07-19 00:00:00, 2017-07-20 00:00:00, 2017-07-21 00:00:00, 2017-07-22 00:00:00, 2017-07-25 00:00:00, 2017-07-26 00:00:00]],
           labels=[[0, 0, 1, 2, 2, 3], [2, 3, 4, 0, 1, 5]],
           names=[None, u'datetime'])

This gets me very close, but this object is still MultiIndexed. How can we just get the datetime index back?

Adam Hughes
  • 14,601
  • 12
  • 83
  • 122

1 Answers1

2

This may be what you want:

dummy_data.groupby(['Ccy', 'Venue'], group_keys=False)\
          .apply(lambda x: x[x['Price'] > 0.8])
cs95
  • 379,657
  • 97
  • 704
  • 746
GeoMatt22
  • 255
  • 2
  • 6
  • Ahh slick thank you. Wasn't aware of this magic keyword. Any reason that this solution is preferable to COLDSPEED's suggestion? Both do the same thing more or less. Can you update answe rto have the .apply() filter just for completeness – Adam Hughes Jul 20 '17 at 19:03
  • @AdamHughes Speed... this solution is a few ticks faster because the index is inherently dropped. – cs95 Jul 20 '17 at 19:06
  • Ok, I'm going to mark this as the accepted answer for (presumed) speed increase and winning code golf. Appreciate both solutions – Adam Hughes Jul 20 '17 at 19:07
  • Not really. It just seemed cleaner to do it up front, rather than after (e.g. if you are doing multiple operations on the `groupby()`, then you only do this once). I did not know the magic word until just now, but seems handy :) – GeoMatt22 Jul 20 '17 at 19:08
  • Just noticed that this works without the `as_index` keyword. Should we drop it and only retain `group_keys` in accepted answer? – Adam Hughes Jul 20 '17 at 19:19
  • 1
    @AdamHughes Edited. Also, deleted my answer, this is miles better. – cs95 Jul 20 '17 at 19:24