timestamp binning mechanics when resampling

Question

I'm not quite clear on how bin-membership in DataFrame.resample is determined.

Example/actual output:

>>> df = pd.DataFrame(index=pd.date_range(start='2021-04-21 01:00:00', end='2021-04-28 01:00', freq='1d'), data=[1]*8)
>>> df
                     0
2021-04-21 01:00:00  1
2021-04-22 01:00:00  1
2021-04-23 01:00:00  1
2021-04-24 01:00:00  1
2021-04-25 01:00:00  1
2021-04-26 01:00:00  1
2021-04-27 01:00:00  1
2021-04-28 01:00:00  1
>>> df.resample(rule='7d', origin='2021-04-29 00:00:00', closed='right', label='right').sum() 
            0
2021-04-22  2
2021-04-29  6

Expected output:

            0
2021-04-22  1
2021-04-29  7

Reasoning:

I expected pandas to create the two bins

(2021-04-15 00:00:00, 2021-04-22 00:00:00]
(2021-04-22 00:00:00, 2021-04-29 00:00:00]

and the timestamp 2021-04-21 01:00:00 to fall into the first bin, while 2021-04-22 01:00:00 and the remaining timestamps should fall into the second bin.

edit: I just realized that using 24*7=168 hours instead of 7 days yields the expected result. Why?!

>>> df.resample(rule='168h', origin='2021-04-22 00:00:00', closed='right', label='right').sum() 
            0
2021-04-22  1
2021-04-29  7

I'm using pandas 1.3.5

@Corralien not completely, and I did not want to pester you with further follow up questions in the comments. — actual_panda, Dec 21 '21 at 13:52

score 1 · Answer 1 · answered Dec 21 '21 at 14:46

From source code, I added a debug line to understand

def _get_time_bins(self, ax: DatetimeIndex):

    # XXX: Debug - pandas/core/resample.py#L1630
    print(f"binner: {binner}\nbins: {bins}\nlabels: {labels}\nbin_edges: {bin_edges}")

    return binner, bins, labels

Your try:

>>> df.resample(rule='7d', origin='2021-04-29 00:00:00', closed='right', label='right').sum()

# Debug
binner: DatetimeIndex(['2021-04-15', '2021-04-22', '2021-04-29'], dtype='datetime64[ns]', freq='7D')
bins: [2 8]
labels: DatetimeIndex(['2021-04-22', '2021-04-29'], dtype='datetime64[ns]', freq='7D')
bin_edges: [1618531199999999999 1619135999999999999 1619740799999999999]

# Result
            0
2021-04-22  2
2021-04-29  6

To obtain the expected result:

>>> df.resample(rule='7d', origin='2021-04-29 00:00:00', closed='left', label='right').sum()

# Debug
binner: DatetimeIndex(['2021-04-15', '2021-04-22', '2021-04-29'], dtype='datetime64[ns]', freq='7D')
bins: [1 8]
labels: DatetimeIndex(['2021-04-22', '2021-04-29'], dtype='datetime64[ns]', freq='7D')
bin_edges: [1618444800000000000 1619049600000000000 1619654400000000000]

# Result
            0
2021-04-22  1
2021-04-29  7

I just realized that using 24*7=168 hours instead of 7 days yields the expected result. Why?!

>>> df.resample(rule='168h', origin='2021-04-22 00:00:00', closed='right', label='right').sum()

# Debug
binner: DatetimeIndex(['2021-04-15', '2021-04-22', '2021-04-29'], dtype='datetime64[ns]', freq='168H')
bins: [1 8]
labels: DatetimeIndex(['2021-04-22', '2021-04-29'], dtype='datetime64[ns]', freq='168H')
bin_edges: [1618444800000000000 1619049600000000000 1619654400000000000]

# Result:
            0
2021-04-22  1
2021-04-29  7

In fact, I think Pandas truncates at first the datetime according the unit in the rule 'D' or 'H'. I suppose why the behavior for '7D' and '168H' is different. Maybe you should open an issue to github.

Yeah I don't get why the timestamps are truncated before they are sorted into the correct bin. — actual_panda, Dec 21 '21 at 14:54

timestamp binning mechanics when resampling

1 Answers1

Linked