6

I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.

Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.

index = pd.date_range(start="1/1/2018", end="31/12/2018") 
df = pd.DataFrame(np.random.randint(100, size=len(index)), 
                  columns=["sales"], index=index)

>>> df.head()
            sales
2018-01-01     66
2018-01-02     18
2018-01-03     45
2018-01-04     92
2018-01-05     76

Now I resample by one month, everything looks fine:

>>>df.resample("1M").sum()

            sales
2018-01-31   1507
2018-02-28   1186
2018-03-31   1382
[...]
2018-11-30   1342
2018-12-31   1337

If I try to resample by more months though binning starts to look off. This is particularly evident with 6M

df.resample("6M").sum()                                                           
            sales
2018-01-31   1507
2018-07-31   8393
2019-01-31   7283

First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:

df.resample("6M", closed="left").sum()                                            
            sales
2018-06-30   8090
2018-12-31   9054
2019-06-30     39

Now I have an extra bin in 2019 with data from 2018-12-31...

Is this working properly? am I missing any option I should set?

EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.

df.resample("6M", closed="left").sum()                                            
            sales
2018-06-30   8090
2018-12-31   9093 # 9054 + 39

Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.

filippo
  • 5,197
  • 2
  • 21
  • 44
  • 2
    Maybe help [this](https://stackoverflow.com/a/48342103) – jezrael Dec 10 '18 at 06:40
  • @jezrael thanks looked at it briefly, played with `closed`, `label` and `loffset` but I don't think there's a way to get the behavior I'd expect here... – filippo Dec 10 '18 at 07:03
  • Can you add to question your expected output? I think datetimes, if possible also aggregated numbers – jezrael Dec 10 '18 at 07:07

2 Answers2

5

The M time offset alias implies month end frequency. What you need is 6MS which is an alias for month start frequency:

df.resample('6MS').sum()

resulting in

            sales
2018-01-01   8130
2018-07-01   9563
2019-01-01      0

Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.


For extra clarity you can compare ranges directly:

>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')

>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
ayorgo
  • 2,803
  • 2
  • 25
  • 35
3

Adding np.random.seed(365) to check both our outputs.

print(df.resample("6M", kind='period').sum())

         sales
2018-01   8794
2018-07   9033

would this work for you?

Zanshin
  • 1,262
  • 1
  • 14
  • 30
  • cool, I think the other answer solves my problem, but this is probably more useful for the task I needed it for, thanks! – filippo Dec 10 '18 at 10:52
  • don't see how it does, but okay. the first day of 2019 is still showing in the other answer. though being 0 – Zanshin Dec 10 '18 at 10:58
  • Sure but it explains where the binning logic was at fault: it's not enough to say which side the of the interval is closed, you also need to say if you want it to start at the first day of the month or the last. Using `"period"` solves everything doing the proper thing you'd expect and I like it way more, but it doesn't explain the odd behaviour I was asking about. Also changes the index type. – filippo Dec 10 '18 at 11:03