Pandas dataframe: Fill in interpolated date values per category

Question

I'm working in Python 3 with a Pandas DataFrame. It has columns for Category, Date and Value. For each category, I want to add rows with the missing days, such the value is linearly interpolated.

To create the minimal example, I use the following code

df = pd.DataFrame({
'cat':['A', 'A', 'A', 'A', 'B', 'B', 'B'],
'date': ['2021-1-1', '2021-1-4', '2021-1-5', '2021-1-7', '2021-11-1', '2021-11-2', '2021-11-5'],
'value': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 9.0]})

df['cat'] = df['cat'].astype('category')
df['date'] = df['date'].astype('datetime64')

Which gives the following dataframe

cat date          value
A   2021-01-01    1.0
A   2021-01-04    2.0
A   2021-01-05    3.0
A   2021-01-07    4.0
B   2021-11-01    5.0
B   2021-11-02    6.0
B   2021-11-05    9.0

I would like the output to be like this example, Where I used '<' to indicate the newly inserted rows

cat date          value
A   2021-01-01    1.0
A   2021-01-02    1.333  <
A   2021-01-03    1.667  <
A   2021-01-04    2.0
A   2021-01-05    3.0
A   2021-01-06    3.5    <
A   2021-01-07    4.0
B   2021-11-01    5.0
B   2021-11-02    6.0
B   2021-11-03    7.0    <
B   2021-11-04    8.0    <
B   2021-11-05    9.0

In the actual problem, I don't want the weekend-days (Saturdays and Sundays), but I've stated the problem like above to prevent adding extra layers (I can easily filter the weekend-days out later if needed). However, not including them in the first place may allow for more efficient code, so I'd thought I'd mention this snag as well. Thanks for any help!

jezrael · Accepted Answer · 2021-11-09T11:17:28.077

Use DataFrame.groupby with resample or Series.asfreq for missing values for days and then interpolate per groups in lambda function:

df = (df.set_index('date')
        .groupby('cat')['value']
        .apply(lambda x: x.asfreq('d').interpolate())
        .reset_index())
print (df)
   cat       date     value
0    A 2021-01-01  1.000000
1    A 2021-01-02  1.333333
2    A 2021-01-03  1.666667
3    A 2021-01-04  2.000000
4    A 2021-01-05  3.000000
5    A 2021-01-06  3.500000
6    A 2021-01-07  4.000000
7    B 2021-11-01  5.000000
8    B 2021-11-02  6.000000
9    B 2021-11-03  7.000000
10   B 2021-11-04  8.000000
11   B 2021-11-05  9.000000

df = (df.set_index('date')
        .groupby('cat')['value']
        .apply(lambda x: x.resample('d').first().interpolate())
        .reset_index())
print (df)
   cat       date     value
0    A 2021-01-01  1.000000
1    A 2021-01-02  1.333333
2    A 2021-01-03  1.666667
3    A 2021-01-04  2.000000
4    A 2021-01-05  3.000000
5    A 2021-01-06  3.500000
6    A 2021-01-07  4.000000
7    B 2021-11-01  5.000000
8    B 2021-11-02  6.000000
9    B 2021-11-03  7.000000
10   B 2021-11-04  8.000000
11   B 2021-11-05  9.000000

Or:

f = lambda x: x.interpolate()
s = df.set_index('date').groupby('cat')['value'].resample('d').first().groupby(level=0).apply(f)
print (s)
cat  date      
A    2021-01-01    1.000000
     2021-01-02    1.333333
     2021-01-03    1.666667
     2021-01-04    2.000000
     2021-01-05    3.000000
     2021-01-06    3.500000
     2021-01-07    4.000000
B    2021-11-01    5.000000
     2021-11-02    6.000000
     2021-11-03    7.000000
     2021-11-04    8.000000
     2021-11-05    9.000000
Name: value, dtype: float64

This worked perfectly. Also helped me learn about the way to think about these type of problems. Thank you! — kerel123, Nov 09 '21 at 12:23

score 1 · Answer 2 · answered Nov 09 '21 at 11:08

You could use a helper function:

def interpolate(d, on='date', vals=['value']):
    return (d.set_index(on).reindex(pd.date_range(d[on].min(), d[on].max()))
             [vals].interpolate()
             .rename_axis(on)
           ) 

df.groupby('cat').apply(interpolate).reset_index()

output:

   cat       date     value
0    A 2021-01-01  1.000000
1    A 2021-01-02  1.333333
2    A 2021-01-03  1.666667
3    A 2021-01-04  2.000000
4    A 2021-01-05  3.000000
5    A 2021-01-06  3.500000
6    A 2021-01-07  4.000000
7    B 2021-11-01  5.000000
8    B 2021-11-02  6.000000
9    B 2021-11-03  7.000000
10   B 2021-11-04  8.000000
11   B 2021-11-05  9.000000

score 1 · Answer 3 · answered Nov 09 '21 at 11:20

An option is a combination of interpolate with complete:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

dates = dict(date = lambda df: pd.date_range(df.min(), df.max(), freq='1D'))

(df.complete(dates, by='cat', sort = True)
   .assign(value = lambda df: df.value.interpolate())
)
   cat       date     value
0    A 2021-01-01  1.000000
1    A 2021-01-02  1.333333
2    A 2021-01-03  1.666667
3    A 2021-01-04  2.000000
4    A 2021-01-05  3.000000
5    A 2021-01-06  3.500000
6    A 2021-01-07  4.000000
7    B 2021-11-01  5.000000
8    B 2021-11-02  6.000000
9    B 2021-11-03  7.000000
10   B 2021-11-04  8.000000
11   B 2021-11-05  9.000000

complete exposes the missing values,after which we then interpolate on a linear method.

:D I really thought "*now it's only missing the pyjanitor solution, where is sammywemmy?*". Well, here you are ;) — mozway, Nov 09 '21 at 11:21

Pandas dataframe: Fill in interpolated date values per category

3 Answers3