1

This is a follow-up question for the post Python pandas most common value over rolling date window. I have a similar problem for which the given solution doesn't work. Can anyone comment! Thanks

I am trying to get a rolling window majority class for my power state data, so that one-off readings are ignored. I achieved something similar in SQL with the help of this post Apply mode operation on categorical data in SQL. Reproducible data and code are given below.

Pandas Series:

x = pd.Series(['EDC','EDC','EDC','DG','DWN','DWN','EDC','DWN','DWN','DWN','EDC','DWN'],name='Power_State')
x

0     EDC
1     EDC
2     EDC
3      DG
4     DWN
5     DWN
6     EDC
7     DWN
8     DWN
9     DWN
10    EDC
11    DWN
Name: Power_State, dtype: object

Code:

from scipy.stats import mode
x.rolling(window=7).apply(lambda x: mode(x)[0])

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\window\rolling.py in         _prep_values(self, values)
    322             try:
--> 323                 values = ensure_float64(values)
    324             except (ValueError, TypeError) as err:

pandas\_libs\algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

ValueError: could not convert string to float: 'EDC'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\window\rolling.py in _apply_series(self, homogeneous_func, name)
    403         try:
--> 404             values = self._prep_values(obj._values)
    405         except (TypeError, NotImplementedError) as err:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\window\rolling.py in     _prep_values(self, values)
    324             except (ValueError, TypeError) as err:
--> 325                 raise TypeError(f"cannot handle this type ->     {values.dtype}") from err
    326 

TypeError: cannot handle this type -> object

The above exception was the direct cause of the following exception:

DataError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_6220/2483769170.py in <module>
      1 from scipy.stats import mode
      2 
----> 3 x.rolling(window=7).apply(lambda x: mode(x)[0])
Aamir M. Khan
  • 138
  • 1
  • 8

2 Answers2

1

The problem here is that scipy's mode() function only supports numerical data. However, you can fix that if you can find a way to convert your string into an integer and back again. There is a way to do that, called factorization. Essentially, each unique string is assigned a unique integer. Then, you find the most common integer. The most common integer will correspond to the most common string.

Here's how to factorize the series and do the mode calculation:

from scipy.stats import mode
import pandas as pd
x = pd.Series(['EDC','EDC','EDC','DG','DWN','DWN','EDC','DWN','DWN','DWN','EDC','DWN'],name='Power_State')

def remove_noise(x):
    # Factorize
    y, label = pd.factorize(x)
    y = pd.Series(y)
    label = pd.Series(label)
    # Correct values
    y = y.rolling(window=7, min_periods=1).apply(lambda x: mode(x)[0])
    # Unfactorize
    y = y.map(label)
    return y

remove_noise(x)

(This code is based on this answer.)

This gives me this result:

0     EDC
1     EDC
2     EDC
3     EDC
4     EDC
5     EDC
6     EDC
7     EDC
8     DWN
9     DWN
10    DWN
11    DWN
dtype: object
Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • I am currently using some similar conversion concept that works: x.map({'EDC':1.0, 'DG':2.0, 'SLR':3.0, 'BAT':4.0, 'DWN':5.0, 'UNDEF':6.0}).rolling(3, center=True).apply(lambda x: x.mode()[0]).map({1.0:'EDC', 2.0:'DG', 3.0:'SLR', 4.0:'BAT', 5.0:'DWN', 6.0:'UNDEF'}) But I have millions of data entries and this dual conversion has a huge toll and its quite time consuming (though I haven't yet timed it). I tried to create my own custom 'mode' function, using groupby() but still no hope. – Aamir M. Khan Jan 01 '22 at 21:29
  • Is it logical to have mode() operator only based on numerical data? Is it the package limitation or there is some underlying reason? – Aamir M. Khan Jan 01 '22 at 21:34
  • @AamirM.Khan I took a second look, and it turns out that Scipy mode() doesn't have the numeric limitation - it's actually a limitation of Pandas rolling(). See here: https://stackoverflow.com/questions/52657429/rolling-with-string-variables – Nick ODell Jan 01 '22 at 23:00
  • In terms of speed: why don't you time it and check? – Nick ODell Jan 01 '22 at 23:02
  • When I use list(df.Power_State.rolling(3, center=True)), it shows me the correct list of window elements from my data. How come then rolling is the issue? For timing, I can do it for pandas but the alternate SQL query is for Google bigquery data and I don't know how to time it over there. Will try to find some way out. – Aamir M. Khan Jan 01 '22 at 23:28
0

You can also leverage the pandas categorical dtype:

create data:

import pandas as pd
x = pd.Series(['EDC','EDC','EDC','DG','DWN','DWN','EDC','DWN','DWN','DWN','EDC','DWN'], name='Power_State')

access the codes (numbers) used for each catergory and apply rolling on them:

y = x.cat.codes.rolling(7, min_periods=1).apply(lambda x: x.mode()[0]).astype("i")

output:

0     2
1     2
2     2
3     2
4     2
5     2
6     2
7     1
8     1
9     1
10    1
11    1
dtype: int32

Convert back to str:

x_new = y.map(lambda _y: x.cat.categories[_y])

or

import numpy as np
x_new = np.take(x.cat.categories, y)

(whichever is faster for you)

output:

0     EDC
1     EDC
2     EDC
3     EDC
4     EDC
5     EDC
6     EDC
7     DWN
8     DWN
9     DWN
10    DWN
11    DWN
dtype: object
n4321d
  • 1,059
  • 2
  • 12
  • 31