0

Context: I have a large table with logon times. I want to calculate a rolling count of logons within a specified period (e.g. 3600 sec).

In SQL/HQL i would specify this as:

SELECT id, logon_time, COUNT(*) OVER(
  PARTITION BY id ORDER BY logon_time 
  RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW) as window_count
FROM df

As a workaround in sparklyr, we could write this as:

df %>%
  mutate(window_count = sql('COUNT(*) OVER(
      PARTITION BY id ORDER BY logon_time 
      RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW'))

I want to know if there's an existing sparklyr/dplyr way of achieving this. This issue relating to adding the "window_frame" function to dbplyr looks promising, however i can't find a way to specify "range between", the below operation only generates a "rows between" window:

df %>%
  group_by(id) %>%
  window_frame(-3600, 0) %>%
  window_order(logon_time) %>%
  mutate(windowed_count = n())
rookie error
  • 165
  • 1
  • 7

1 Answers1

0

I would create an index column using row_number(), then use ft_bucketizer() with splits at every 3600 rows, and lastly group_by() and tally() to get the count

Here's an example:

split_by <- c(1, ((1:9* 10) + 1), 100)

sample_flights %>%
  head(100) %>%
  mutate(order = as.double(row_number(DayofMonth))) %>%
  ft_bucketizer("order", "bucket", splits =  split_by) %>%
  group_by(bucket) %>%
  tally()
edgararuiz
  • 625
  • 5
  • 9