I have a MultiIndex pandas DataFrame after calling groupby
, and need to flatten it in a fashion similar to flattening a pivot table. Most of the analysis is already built around the grouped object, so would prefer to not have to refactor into a pivot table.
Consider some dummy data,
dummy_data = pd.DataFrame({'Ccy' : ['EUR', 'EUR', 'CAD', 'CAD', 'EUR', 'EUR', 'CAD', 'EUR'],
'Venue' : ['BAML']*5 + ['BARX']*3,
'Price': np.abs(np.random.randn(8)),
'volume': np.abs(10*np.random.randn(8))
},
index = pd.date_range('7/19/2017', periods=8))
dummy_data.index.name = "datetime"
print dummy_data
>>> Ccy Price Venue volume
datetime
2017-07-19 EUR 1.338521 BAML 11.227553
2017-07-20 EUR 0.882715 BAML 0.307711
2017-07-21 CAD 0.977815 BAML 14.196170
2017-07-22 CAD 1.262272 BAML 0.055213
2017-07-23 EUR 0.752433 BAML 5.315777
2017-07-24 EUR 0.699008 BARX 2.299045
2017-07-25 CAD 1.625567 BARX 6.474822
2017-07-26 EUR 2.122562 BARX 5.026135
The goal is to group data by Ccy
and Venue
, apply a filter
operation to each subgroup, and then flatten the groups back into the format of this original frame. Consider a simple filter that retains rows where price > 0.8
dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8])
>>> Ccy Price Venue volume
Ccy Venue datetime
CAD BAML 2017-07-21 CAD 0.977815 BAML 14.196170
2017-07-22 CAD 1.262272 BAML 0.055213
BARX 2017-07-25 CAD 1.625567 BARX 6.474822
EUR BAML 2017-07-19 EUR 1.338521 BAML 11.227553
2017-07-20 EUR 0.882715 BAML 0.307711
BARX 2017-07-26 EUR 2.122562 BARX 5.026135
All I need now is to fill-in/flatten the Ccy
and Venue
columns such that
Ccy Venue datetime
CAD BAML 2017-07-21 ... CAD BAML 2017-07-21 ...
2017-07-22 ... CAD BAML 2017-07-22 ...
BARX 2017-07-25 ... ---> BECOMES ---> CAD BARX 2017-07-25 ...
EUR BAML 2017-07-19 ... EUR BAML 2017-07-19 ...
2017-07-20 ... EUR BAML 2017-07-20 ...
I need to do this because we have a range of plotting utilities that aren't flexible enough to handle grouped data. Shouldn't there be an ungroup()
or flatten()
method that is reciprocal to the groupby()
operation?
Note: the trivial filter in this example could have been applied to the ungrouped data. In reality, my filter is more complex and only makes sense on subgroups.
Solution Attempts (Attempts 1 and 2 from pivot table solution)
Attempt 1: melt
method loses the datetime
index.
print dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8]).melt()
>>> variable value
0 Ccy CAD
1 Ccy CAD
2 Ccy CAD
3 Ccy EUR
4 Ccy EUR
5 Ccy EUR
6 Price 0.977815
7 Price 1.26227
8 Price 1.62557
9 Price 1.33852
Attempt 2: the accepted answer leads to a KeyError
dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8]).unstack().reset_index().drop('level_0', axis=1)
KeyError: 'level_0'
Attempt 3: reset_index()
leads to ValueError
dummy_data.groupby(['Ccy', 'Venue']).apply(lambda x: x[x['Price'] > 0.8]).reset_index()
ValueError: cannot insert Venue, already exists
Attempt 4: as_index=False
(without group_keys
keyword)
out = dummy_data.groupby(['Ccy', 'Venue'], as_index=False).apply(lambda x: x[x['Price'] > 0.8])
print out
print out.index
Ccy Price Venue volume
datetime
0 2017-07-21 CAD 0.977815 BAML 14.196170
2017-07-22 CAD 1.262272 BAML 0.055213
1 2017-07-25 CAD 1.625567 BARX 6.474822
2 2017-07-19 EUR 1.338521 BAML 11.227553
2017-07-20 EUR 0.882715 BAML 0.307711
3 2017-07-26 EUR 2.122562 BARX 5.026135
MultiIndex(levels=[[0, 1, 2, 3], [2017-07-19 00:00:00, 2017-07-20 00:00:00, 2017-07-21 00:00:00, 2017-07-22 00:00:00, 2017-07-25 00:00:00, 2017-07-26 00:00:00]],
labels=[[0, 0, 1, 2, 2, 3], [2, 3, 4, 0, 1, 5]],
names=[None, u'datetime'])
This gets me very close, but this object is still MultiIndexed
. How can we just get the datetime index back?