2

I have two data frames with different resolution MultiIndex values. The first data frame tracks state, year, and hour variables, while the second data frame only tracks state and hour variables. How can I join the two data frames such that the second data frame is repeated for each year? Here is a small reprex:

#!/usr/bin/env python
import pandas as pd
import numpy as np

YEARS = np.arange(2015, 2055, 5)
HOURS = np.arange(1, 8761, 1)

idx0 = pd.MultiIndex.from_product([['NY'], YEARS, HOURS], names=['state', 'year', 'hour'])
df0 = pd.DataFrame(
    np.random.rand(70080, 1),
    columns=['TOTALLOAD'],
    index=idx0
)
#                  TOTALLOAD
# state year hour
# NY    2015 1      0.283680
#            2      0.475532
#            3      0.309519
#            4      0.360841
#            5      0.835909
# ...                    ...
#       2050 8756   0.152357
#            8757   0.431416
#            8758   0.694271
#            8759   0.766727
#            8760   0.365636
# [70080 rows x 1 columns]

idx1 = pd.MultiIndex.from_product([['NY'], HOURS], names=['state', 'hour'])
df1 = pd.DataFrame(
    np.random.rand(8760, 2),
    columns=['WIND', 'SOLAR'],
    index=idx1
)
#                 WIND     SOLAR
# state hour
# NY    1     0.835902  0.488313
#       2     0.066278  0.137335
#       3     0.600126  0.526955
#       4     0.661116  0.464443
#       5     0.547492  0.840770
# ...              ...       ...
#       8756  0.786757  0.829725
#       8757  0.753301  0.985739
#       8758  0.856252  0.196507
#       8759  0.322048  0.650752
#       8760  0.798451  0.050394
# [8760 rows x 2 columns]


# Expected Dataframe
df2 = pd.DataFrame(
    np.random.rand(70080, 3),
    columns=['TOTALLOAD', 'WIND', 'SOLAR'],
    index=idx0
)
#                  TOTALLOAD      WIND     SOLAR
# state year hour
# NY    2015 1      0.108648  0.767488  0.391436
#            2      0.012284  0.616722  0.314657
#            3      0.954200  0.666713  0.112874
#            4      0.078314  0.170395  0.697167
#            5      0.893799  0.356095  0.706867
# ...                    ...       ...       ...
#       2050 8756   0.233540  0.183593  0.178483
#            8757   0.473917  0.662912  0.135068
#            8758   0.124253  0.784875  0.528687
#            8759   0.350463  0.107292  0.446881
#            8760   0.223953  0.689475  0.425816
#
# [70080 rows x 3 columns]
dylanjm
  • 2,011
  • 9
  • 21

1 Answers1

2

Simply use:

df0.join(df1)

Because the MultiIndexes have two common levels, the join will be performed on those two and broadcasted to the third one by duplicating the rows.

Example:

>>> df0.join(df1).head()
                 TOTALLOAD      WIND     SOLAR
state hour year                               
NY    1    2015   0.164803  0.406091  0.822639
           2020   0.805777  0.406091  0.822639
           2025   0.480883  0.406091  0.822639
           2030   0.284095  0.406091  0.822639
           2035   0.811343  0.406091  0.822639
mozway
  • 194,879
  • 13
  • 39
  • 75
  • 1
    Wow, very simple solution. I noticed this changes the order of the index columns, does order of the index typically matter at all? – dylanjm Nov 12 '21 at 23:15
  • 1
    @dylanjm it depends on your usecase. If you care, it matters. If you don't mind, it matters not in the slightest. –  Nov 13 '21 at 00:24