1

I have a rolling window computation that works in pandas but not in koalas, and I am wondering why:

import pandas as pd
import databricks.koalas as ks

Timestamp = pd.Timestamp

df = pd.DataFrame([[Timestamp('2022-05-18 18:10:50.021831300'), '65.78.97'],
                   [Timestamp('2022-05-24 09:48:39.787426700'), '65.78.97'],
                   [Timestamp('2022-05-24 16:06:18.765405500'), '65.78.97'],
                   [Timestamp('2022-05-25 03:04:01.860841300'), '65.78.97'],
                   [Timestamp('2022-05-26 05:01:08.335874700'), '47.41.196'],
                   [Timestamp('2022-05-31 03:57:15.060167500'), '47.41.196'],
                   [Timestamp('2022-05-31 06:32:37.177199300'), '47.41.196']], 
                  columns=['time', 'ip'])

print(pd.DataFrame(df).groupby('ip').apply(lambda x: pd.Series([1]*len(x), index=x['time']).rolling(window='24h', min_periods=1).sum()))

It gives the count of sessions from a particular IP (removed the last octet) within the last 24 hours as expected:

ip         time                         
47.41.196  2022-05-26 05:01:08.335874700    1.0
           2022-05-31 03:57:15.060167500    1.0
           2022-05-31 06:32:37.177199300    2.0
65.78.97   2022-05-18 18:10:50.021831300    1.0
           2022-05-24 09:48:39.787426700    1.0
           2022-05-24 16:06:18.765405500    2.0
           2022-05-25 03:04:01.860841300    3.0

However, if I use koalas for the same computation, I get an error:

print(ks.DataFrame(df).groupby('ip').apply(lambda x: ks.Series([1]*len(x), index=x['time']).rolling(window='1d', min_periods=1).sum()))
TypeError: '<' not supported between instances of 'str' and 'int'

Any help or advice will be greatly appreciated!

Lei
  • 733
  • 1
  • 5
  • 13

0 Answers0