I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a MultiIndex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.
Here is an imaginary example, where I construct two dataframes df1
and df2
import pandas as pd
import numpy as np
arrays = [['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])
df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])
index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],
names=['Body','Season'])
df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])
Giving the dataframes:
df1
A B
Body Season Item
sun summer one -0.409372 0.638502
winter one 1.448772 -1.460596
two -0.495634 -0.839063
moon summer one 1.296035 -1.439349
two -1.002667 0.508394
three -1.247748 -0.645782
winter one -1.848857 -0.858759
two 0.559172 2.202957
df2
Mood
Body Season
sun summer Good
winter Bad
moon summer Ugly
winter Confused
Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:
df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)
resulting in:
A B Mood
Body Season Item
sun summer one -0.121588 0.272774 Good
winter one 0.233562 -2.005623 Bad
two -1.034642 0.315065 Bad
moon summer one 0.184548 0.820873 Ugly
two 0.838290 0.495047 Ugly
three 0.450813 -2.040089 Ugly
winter one -1.149993 -0.498148 Confused
two 2.406824 -2.031849 Confused
[8 rows x 3 columns]
It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.
Is there a better way to do this?