Pandas: Is it possible to down-sample categorical column?

Question

Let's have a DataFrame log such this one:

>>> log
                           state
date_time                       
2020-01-01 00:00:00            0
2020-01-01 00:01:00            0
2020-01-01 00:02:00            0
2020-01-01 00:03:00            1
2020-01-01 00:04:00            1
2020-01-01 00:05:00            1

where state column can be either 0 or 1 (or missing). If represented with UInt8 (smallest numeric datatype supporting <NA>) one can down-sample the data like this:

>>> log.resample(dt.timedelta(minutes=2)).mean()
                           state
date_time                       
2020-01-01 00:00:00          0.0
2020-01-01 00:02:00          0.5
2020-01-01 00:04:00          1.0

The resampling works just fine, only the value 0.5 make no sense, since it can be only 0 or 1. From the same reason, it make sense to use category as dtype for this column. However, in this case the resampling will not work as the mean() method is only applicable to numerical data.

This makes a perfect sense - however - I can imagine a down-sampling & averaging procedure on categirical data where, as long as the data in group stays identical, the result will be that particular value, otherwise the result will be <NA>, like:

categorical_average(['aple', 'aple']) -> 'aple'
categorical_average(['pear', 'pear']) -> 'pear'
categorical_average(['aple', 'pear']) -> <NA>

Which for presented DataFrame log with category state column would result in:

>>> log.resample(dt.timedelta(minutes=2)).probably_some_other_method()
                         state
date_time                       
2020-01-01 00:00:00          0
2020-01-01 00:02:00       <NA>
2020-01-01 00:04:00          1

BTW, I am doing resample.main() because there are many other (numerical) columns, where it make perfect sense, I just did not mentioned it explicitelly here for simplicity.

jezrael · Accepted Answer · 2021-02-10T10:51:48.353

1

Use custom function for test if unique values with if-else:

f = lambda x: x.iat[0] if len(x) > len(set(x)) else pd.NA
a = log.resample(dt.timedelta(minutes=2)).agg({'state':f})
print (a)
                    state
date_time                
2020-01-01 00:00:00     0
2020-01-01 00:02:00  <NA>
2020-01-01 00:04:00     1

edited Feb 10 '21 at 10:51

answered Feb 10 '21 at 10:46

jezrael

822,522
95
1,334
1,252

1

Thank you! Due to the consistency with other actually missing values I only changed pd.NA to np.nan (it looks like there is no NaN in pandas) as categorical data use NaN for missing vaues, while IntegerArray use NA... – rad Feb 10 '21 at 14:00

Pandas: Is it possible to down-sample categorical column?

1 Answers1