3

In pandas, I can reindex() the dataframe using multi-index to make the date range consistent for each group. Is there any way to produce the same result in polars?

See example below using pandas:

import pandas as pd
data = pd.DataFrame({
             "date":pd.date_range("2022-01-01", "2022-06-01", freq="MS"),
             "group":["A", "A", "A", "B", "B", "B"],
             "value":[10,20,30,40,50,60]
     }).set_index(["group", "date"])
new_index = pd.MultiIndex.from_product([data.index.levels[0].tolist(), data.index.levels[1].tolist()], names=["group", "date"])
data.reindex(new_index)

which transform data from:

                  value
group date             
A     2022-01-01     10
      2022-02-01     20
      2022-03-01     30
B     2022-04-01     40
      2022-05-01     50
      2022-06-01     60

to below where both groups are having the same date range:

                  value
group date             
A     2022-01-01   10.0
      2022-02-01   20.0
      2022-03-01   30.0
      2022-04-01    NaN
      2022-05-01    NaN
      2022-06-01    NaN
B     2022-01-01    NaN
      2022-02-01    NaN
      2022-03-01    NaN
      2022-04-01   40.0
      2022-05-01   50.0
      2022-06-01   60.0

1 Answers1

0

As you may have read, polars does not use an index.

In this case your new_index is a cross join of the unique values of your index columns which can easily be reproduced in polars.

pldata=pl.DataFrame({
            "date":pl.date_range(datetime(2022,1,1), datetime(2022,6,1),'1mo'),
            "group":["A", "A", "A", "B", "B", "B"],
            "value":[10,20,30,40,50,60]
    })

pl_index = pldata.select(pl.col('date').unique()) \
        .join(
              pldata.select(pl.col('group').unique()), 
         how='cross'
         )

Then, instead of the reindex command, you do another join, this time an outer join on those columns, with a sort to get back your order

pldata.join(plnew_index, on=['date','group'], how='outer').sort(['group','date'])

You can make a helper function for making the plnew_index for an arbitrary number of index cols

def make_plindex(df, indexcols):
    newdf=df.select(pl.col(indexcols[0]).unique())
    for curcol in indexcols[1:]:
        newdf=newdf.join(df.select(pl.col(curcol).unique()), how='cross')
    return newdf

and, of course, if you don't actually care about the intermediate df you can extend that function by putting the outer join in it returning the final result

def make_nullrows(df, indexcols):
    newdf=df.select(pl.col(indexcols[0]).unique())
    for curcol in indexcols[1:]:
        newdf=newdf.join(df.select(pl.col(curcol).unique()), how='cross')
    return df.join(newdf, on=indexcols, how='outer')

Then you can just do

make_nullrows(pldata, ['group','date'])
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72