2

I have a pandas DataFrame like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.to_datetime(['2021-06-06']*3 + ['2021-06-07']*3 + ['2021-06-09']*3),
    'cat': ['a', 'b', 'c']*3,
    'value': np.random.randn(9)
    })

#         date cat     value
# 0 2021-06-06   a  1.182969
# 1 2021-06-06   b  0.460896
# 2 2021-06-06   c  0.859053
# 3 2021-06-07   a  0.259080
# 4 2021-06-07   b -1.333531
# 5 2021-06-07   c  1.258963
# 6 2021-06-09   a  0.254633
# 7 2021-06-09   b  0.250557
# 8 2021-06-09   c  1.363891

I would like to fill in the date(s) that are missing in the date column, but for all unique values from the cat column. For value column, it is okey to leave there NaN. The following code seems to achieve what I need, but I was just wondering if there is an easier way to do the same.

from itertools import product

df_out = pd.DataFrame(
    product(
        pd.date_range(df.date.min(), df.date.max(), freq="D"),
        df.cat.unique()
        )
    )\
    .set_axis(["date", "cat"], axis=1)\
    .merge(df, how="left", on=["date", "cat"])
    
#          date cat     value
# 0  2021-06-06   a  1.090784
# 1  2021-06-06   b -0.479246
# 2  2021-06-06   c  1.886213
# 3  2021-06-07   a  1.477877
# 4  2021-06-07   b -0.026864
# 5  2021-06-07   c -1.432626
# 6  2021-06-08   a       NaN
# 7  2021-06-08   b       NaN
# 8  2021-06-08   c       NaN
# 9  2021-06-09   a  0.055170
# 10 2021-06-09   b -2.060127
# 11 2021-06-09   c  0.402082
Jakub.Novotny
  • 2,912
  • 2
  • 6
  • 21
  • 1
    instead of a merge you can create a multiindex using the product method and then use `df.reindex` after the groupby. https://stackoverflow.com/questions/27421256/re-index-dataframe-by-new-range-of-dates – Ian Wright Jun 09 '21 at 15:38
  • I looked at the answers there, but to be honest, I was not sure how to set the multiindex, so I will just go with @Shubham Sharma's solution. Appreciate the effort though. – Jakub.Novotny Jun 09 '21 at 16:00

1 Answers1

4

We can pivot the dataframe to reshape it into wide format, then reindex the dataframe using asfreq in order to fill the missing dates then stack the frame to reshape it back to long format

df.pivot('date', 'cat').asfreq('D').stack(dropna=False).reset_index()

         date cat     value
0  2021-06-06   a  1.312101
1  2021-06-06   b  0.427093
2  2021-06-06   c  0.889593
3  2021-06-07   a -0.145082
4  2021-06-07   b -0.262260
5  2021-06-07   c  0.827072
6  2021-06-08   a       NaN
7  2021-06-08   b       NaN
8  2021-06-08   c       NaN
9  2021-06-09   a  0.273335
10 2021-06-09   b -0.201577
11 2021-06-09   c -1.294582
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • For this example dataset, this works swimmingly, but for some reason, in my real usecase, the asfreq('D') produced all NaNs for the value column. Do you happen to know what might cause that? – Jakub.Novotny Jun 09 '21 at 15:53
  • 1
    Get it working, now, thanks a lot. My date column in the real usecase was a string, not date. Honestly, I would expect pandas throwing an error, and not to produce all NaNs in that case. Anyway, thanks for a neat solution. – Jakub.Novotny Jun 09 '21 at 15:56
  • 1
    @Jakub.Novotny Make sure you have converted the `date` column to pandas `datetime` type – Shubham Sharma Jun 09 '21 at 15:57