1

I am finding an inconsistent output with pandas groupby-resample behavior.

Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:

df1 = pd.DataFrame(index=pd.DatetimeIndex(
    ['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
    data={'category':['A','A','B']})

# Output:
#                    category
#2022-01-01 01:00:00        A
#2022-01-02 01:00:00        A
#2022-01-02 01:00:00        B

When I groupby-resample I get a Series with multiindex on category and time:

res1 = df1.groupby('category').resample('1D').size()

#Output: 
#category            
#A         2022-01-01    1
#          2022-01-02    1
#B         2022-01-02    1
#dtype: int64

But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:

df2 = pd.DataFrame(index=pd.DatetimeIndex(
    ['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
    data={'category':['A','A','B','B']})

res2 = df2.groupby('category').resample('1D').size()

# Output:
#          2022-01-01  2022-01-02
# category                        
# A                  1           1
# B                  1           1

Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.

Jen
  • 146
  • 6

2 Answers2

0

I submitted bug report 46826 to pandas.

Jen
  • 146
  • 6
0

The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

dicristina
  • 335
  • 2
  • 13