1

I have a df like the one in the example below - it is sorted in descending timeframe. I want to calculate by row the numbers of txns in the last hour - it would be at least 1 (the current), but could be of course more

I am using an approach I modified from How to aggregate over rolling time window with groups in Spark

But it is not working and I am at a loss

I expect the first row will yield 1, the second 4, the third 3, the fourth 2 and the fifth 1

I would also like to get the maximum index of those txns that are counted ( ihave not worked on that if I get the first part I think I can modify it to do that too)

from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql.window import Window

schema = StructType([
   StructField("datetime_string", StringType(), True),
   StructField("group_by", StringType(), True),
    StructField("index", IntegerType(), True)
])

df = spark.createDataFrame([Row(datetime_string='2022-01-05 04:12:18', group_by='group1', index=1),
                            Row(datetime_string='2022-01-05 02:43:19', group_by='group1', index=2),
                            Row(datetime_string='2022-01-05 02:23:30', group_by='group1', index=3),
                            Row(datetime_string='2022-01-05 02:20:38', group_by='group1', index=4),
                            Row(datetime_string='2022-01-05 02:18:39', group_by='group1', index=5)
                            ],schema)

#create the timestamp and a version in seconds for math later on
df = (df
      .withColumn('timestamp',F.to_timestamp(F.col('datetime_string')))
      .withColumn('ts_seconds',F.col('timestamp').cast('long'))
     )


w = (Window()
   .partitionBy(F.col("group_by"))
   .orderBy(F.desc('ts_seconds'))
   .rowsBetween(Window.currentRow, Window.unboundedFollowing)
    )

diff = F.max('ts_seconds').over(w) - F.col('ts_seconds')
indicator = (diff < 60*60 ).cast("Integer")
subgroup = F.sum(indicator).over(w)

df1 = (df
       .withColumn('txns_last_hour',subgroup)
      )

df1.show()

what I get is this

|    datetime_string|group_by|index|          timestamp|ts_seconds|txns_last_hour|
+-------------------+--------+-----+-------------------+----------+--------------+
|2022-01-05 04:12:18|  group1|    1|2022-01-05 04:12:18|1641377538|             5|
|2022-01-05 02:43:19|  group1|    2|2022-01-05 02:43:19|1641372199|             4|
|2022-01-05 02:23:30|  group1|    3|2022-01-05 02:23:30|1641371010|             3|
|2022-01-05 02:20:38|  group1|    4|2022-01-05 02:20:38|1641370838|             2|
|2022-01-05 02:18:39|  group1|    5|2022-01-05 02:18:39|1641370719|             1|

If someone can be graceful enough to explain what am I doing wrong I would appreciate it

blackbishop
  • 30,945
  • 11
  • 55
  • 76
user1617979
  • 2,370
  • 3
  • 25
  • 30

1 Answers1

1

Use rangeBetween instead of rowsBetween for the Window frame boundaries. Also, your Window ordering isn't correct. From the result you expect it seems you need to order on ascending not descending.

Try with following modifications:

w = (Window()
     .partitionBy(F.col("group_by"))
     .orderBy('ts_seconds')
     .rangeBetween(-60*60, Window.currentRow)
     )

df1 = (df
       .withColumn('txns_last_hour', F.count("*").over(w))
       .orderBy(F.desc("ts_seconds"))
       )

df1.show()

#+-------------------+--------+-----+-------------------+----------+--------------+
#|    datetime_string|group_by|index|          timestamp|ts_seconds|txns_last_hour|
#+-------------------+--------+-----+-------------------+----------+--------------+
#|2022-01-05 04:12:18|  group1|    1|2022-01-05 04:12:18|1641352338|             1|
#|2022-01-05 02:43:19|  group1|    2|2022-01-05 02:43:19|1641346999|             4|
#|2022-01-05 02:23:30|  group1|    3|2022-01-05 02:23:30|1641345810|             3|
#|2022-01-05 02:20:38|  group1|    4|2022-01-05 02:20:38|1641345638|             2|
#|2022-01-05 02:18:39|  group1|    5|2022-01-05 02:18:39|1641345519|             1|
#+-------------------+--------+-----+-------------------+----------+--------------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76