4

Lots of similar questions on here, but I couldn't find any that actually had observations with the same datetime. A minimum non-working example would be:

df = pd.DataFrame(
    {"Date": np.tile([pd.Series(["2016-01", "2016-03"])], 2)[0],
     "Group": [1,1,2,2],
     "Obs":[1,2,5,6]})

Now I'd like to linearly interpolate the value for February 2016 by group, so the required output is

    Date    Group   Obs
    2016-01     1       1
    2016-02     1     1.5
    2016-03     1       2
    2016-01     2       5
    2016-02     2     5.5
    2016-03     2       6

My understanding is that resample should be able to do this (in my actual application I'm trying to move from quarterly to monthly, so have observations in Jan and Apr), but that requires some sort of time index, which I can't do as there are duplicates in the Date column.

I'm assuming some sort of groupby magic could help, but can't figure it out!

Nils Gudat
  • 13,222
  • 3
  • 39
  • 60

2 Answers2

4

You can use:

#convert column Date to datetime
df['Date'] = pd.to_datetime(df.Date)
print (df)
        Date  Group  Obs
0 2016-01-01      1    1
1 2016-03-01      1    2
2 2016-01-01      2    5
3 2016-03-01      2    6

#groupby, resample and interpolate
df1 = df.groupby('Group').apply(lambda x : x.set_index('Date')
                                            .resample('M')
                                            .first()
                                            .interpolate())
                        .reset_index(level=0, drop=True).reset_index()

#convert Date to period
df1['Date'] = df1.Date.dt.to_period('M')
print (df1)
     Date  Group  Obs
0 2016-01    1.0  1.0
1 2016-02    1.0  1.5
2 2016-03    1.0  2.0
3 2016-01    2.0  5.0
4 2016-02    2.0  5.5
5 2016-03    2.0  6.0

EDIT:

Pandas API was changed (0.18.1), so now you can use:

df['Date'] = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)

df1 = df.groupby('Group').apply(lambda df1: df1.resample('M')
                                               .first()
                                               .interpolate())
                         .reset_index(level=0, drop=True).reset_index()

df1['Date'] = df1.Date.dt.to_period('M')
print (df1)
     Date  Group  Obs
0 2016-01    1.0  1.0
1 2016-02    1.0  1.5
2 2016-03    1.0  2.0
3 2016-01    2.0  5.0
4 2016-02    2.0  5.5
5 2016-03    2.0  6.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Slightly unfair as this wasn't in the original question, but this method fails when grouping by more than one category (I'm presuming this has to do with the resetting of the index done at the end) – Nils Gudat May 18 '16 at 11:00
  • I think you can add level, e.g. if groupby by two categories `.reset_index(level=[0,1], drop=True)` – jezrael May 18 '16 at 11:03
2

Edit: replaced resample with reindex for a 2x speed improvement.

df.set_index('Date', inplace=True)
index = ['2016-01', '2016-02', '2016-03']

df.groupby('Group').apply(lambda df1: df1.reindex(index).interpolate())

Using groupby is easy once you understand it just returns one dataframe (here df1) per value in the grouping column.

IanS
  • 15,771
  • 9
  • 60
  • 84
  • Does the job, but for some reason in my actual application this shifts the dates to the end of the month (i.e. the value for 2016-01 has an index of 2016-01-31 rather than 2016-01-01) - any idea why this would happen? – Nils Gudat May 18 '16 at 10:44
  • 1
    I've noticed the same. You can use Jezrael's workaround: `.dt.to_period` :) – IanS May 18 '16 at 10:46
  • That'll do. One last thing: is there a way to speed this up? My actual example is ~500k rows (growing to 1.5m when moving to monthly), with 4 columns to group by, so this takes a couple of minutes on my system. – Nils Gudat May 18 '16 at 10:54
  • For me your solution return `FutureWarning: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...)` in version `0.18.1`. I think it can works well in version `0.17.2`. What is your version of pandas? – jezrael May 18 '16 at 11:10
  • Interesting... I have pandas 0.17.1. – IanS May 18 '16 at 11:28
  • @NilsGudat I have updated my answer: `reindex` is faster than `resample`. Note that I no longer convert dates to datetime objects. – IanS May 18 '16 at 12:02
  • Faster, but requires that you can actually pass the new index beforehand (so I'd need to know which of my groups have missing rows), which isn't really feasible in my case. – Nils Gudat May 18 '16 at 12:22