Efficiently joining two dataframes based on multiple levels of a multiindex

Question

I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a MultiIndex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.

Here is an imaginary example, where I construct two dataframes df1 and df2

import pandas as pd
import numpy as np

arrays = [['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
          ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
          ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])
df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])

index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],
                                  names=['Body','Season'])

df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])

Giving the dataframes:

df1

                    A         B
Body Season Item                     
sun  summer one   -0.409372  0.638502
     winter one    1.448772 -1.460596
            two   -0.495634 -0.839063
moon summer one    1.296035 -1.439349
            two   -1.002667  0.508394
            three -1.247748 -0.645782
     winter one   -1.848857 -0.858759
            two    0.559172  2.202957

df2

                 Mood
Body Season          
sun  summer      Good
     winter       Bad
moon summer      Ugly
     winter  Confused

Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:

df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)

resulting in:

           A         B      Mood
Body Season Item
sun  summer one   -0.121588  0.272774      Good
     winter one    0.233562 -2.005623       Bad
            two   -1.034642  0.315065       Bad
moon summer one    0.184548  0.820873      Ugly
            two    0.838290  0.495047      Ugly
            three  0.450813 -2.040089      Ugly
     winter one   -1.149993 -0.498148  Confused
            two    2.406824 -2.031849  Confused

[8 rows x 3 columns]

It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.

Is there a better way to do this?

you can always pass `inplace=True` to `reset_index/set_index` — acushner, May 29 '14 at 15:58

CAPSLOCK · Accepted Answer · 2022-02-15T14:51:48.273

22

join now allows merging of MultiIndex DataFrames with partially matching indices.

Following your example:

df1 = df1.join(df2, on=['Body','Season'])

make sure the on columns are specified in exactly the order that match the Index of the other DataFrame as the on argument matches the order you specify the labels of the calling DataFrame with the Index as it is in the other DataFrame you join to.

or just join without using on and by default it will use the common index levels between the two DataFrames:

df1 = df1.join(df2)

Resulting df1:

                          A         B      Mood
Body Season Item                               
sun  summer one   -0.483779  0.981052      Good
     winter one   -0.309939  0.803862       Bad
            two   -0.413732  0.025331       Bad
moon summer one   -0.926068 -1.316808      Ugly
            two    0.221627 -0.226154      Ugly
            three  1.064856  0.402827      Ugly
     winter one    0.526461 -0.932231  Confused
            two   -0.296415 -0.812374  Confused

edited Feb 15 '22 at 14:51

answered May 15 '20 at 15:43

CAPSLOCK

6,243
3
33
56

@Caleb, maybe you can check this one and update the answer. It looks right. – Chris Browne May 19 '20 at 07:39
1

You need to be careful with `join`. The `on` keyword doesn't actually matter for anything in the `other` that you join to it and is only used for the DataFrame calling the method. So you must make sure the `on` columns are specified in exactly the order that match the Index of the `other` DataFrame – ALollz Feb 11 '22 at 18:35
1

So it doesn't actually do anything with common index levels. It matches the order you specify the labels in `on` of the calling DataFrame with the Index _as it is_ in the other DataFrame you join to it. The easiest way to see this is that `df1.join(df2, on=['Season','Body'])` where all you do is switch the on ordering gives you the wrong answer because that is technically trying to match ['Season', 'Body'] of df1 (order you specify) with ['Body', 'Season'] of df2 (order of df2's index) – ALollz Feb 11 '22 at 18:43

score 13 · Answer 2 · edited May 23 '17 at 11:50

13

This is not implemented internally ATM, but your soln is the recommended one, see here as well the issue

You can simply wrap this in a function if you want to make it look nicer. reset_index/set_index do copy (though you can pass an inplace=True argument if you want); it IS truly inplace as these are just changing the index attribute.

You could patch in a nice function like:

def merge_multi(self, df, on):
    return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi

df1.merge_multi(df2,on=['Body','Season'])

However, merging by definition creates new data, so not sure how much this will actually save you.

A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this

edited May 23 '17 at 11:50

Community

1
1

answered May 29 '14 at 15:58

Jeff

125,376
21
220
187

3

Is this still the case? – william_grisaitis May 10 '18 at 13:20
3

Just realized a long time after my original question that I never selected an answer to this. I also wonder if any better canonical solutions have been developed since this answer, but I'll give this one a belated check mark for now. – Caleb May 22 '18 at 17:17

Efficiently joining two dataframes based on multiple levels of a multiindex

2 Answers2

Linked

Related