I have a df like the one in the example below - it is sorted in descending timeframe. I want to calculate by row the numbers of txns in the last hour - it would be at least 1 (the current), but could be of course more
I am using an approach I modified from How to aggregate over rolling time window with groups in Spark
But it is not working and I am at a loss
I expect the first row will yield 1, the second 4, the third 3, the fourth 2 and the fifth 1
I would also like to get the maximum index of those txns that are counted ( ihave not worked on that if I get the first part I think I can modify it to do that too)
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql.window import Window
schema = StructType([
StructField("datetime_string", StringType(), True),
StructField("group_by", StringType(), True),
StructField("index", IntegerType(), True)
])
df = spark.createDataFrame([Row(datetime_string='2022-01-05 04:12:18', group_by='group1', index=1),
Row(datetime_string='2022-01-05 02:43:19', group_by='group1', index=2),
Row(datetime_string='2022-01-05 02:23:30', group_by='group1', index=3),
Row(datetime_string='2022-01-05 02:20:38', group_by='group1', index=4),
Row(datetime_string='2022-01-05 02:18:39', group_by='group1', index=5)
],schema)
#create the timestamp and a version in seconds for math later on
df = (df
.withColumn('timestamp',F.to_timestamp(F.col('datetime_string')))
.withColumn('ts_seconds',F.col('timestamp').cast('long'))
)
w = (Window()
.partitionBy(F.col("group_by"))
.orderBy(F.desc('ts_seconds'))
.rowsBetween(Window.currentRow, Window.unboundedFollowing)
)
diff = F.max('ts_seconds').over(w) - F.col('ts_seconds')
indicator = (diff < 60*60 ).cast("Integer")
subgroup = F.sum(indicator).over(w)
df1 = (df
.withColumn('txns_last_hour',subgroup)
)
df1.show()
what I get is this
| datetime_string|group_by|index| timestamp|ts_seconds|txns_last_hour|
+-------------------+--------+-----+-------------------+----------+--------------+
|2022-01-05 04:12:18| group1| 1|2022-01-05 04:12:18|1641377538| 5|
|2022-01-05 02:43:19| group1| 2|2022-01-05 02:43:19|1641372199| 4|
|2022-01-05 02:23:30| group1| 3|2022-01-05 02:23:30|1641371010| 3|
|2022-01-05 02:20:38| group1| 4|2022-01-05 02:20:38|1641370838| 2|
|2022-01-05 02:18:39| group1| 5|2022-01-05 02:18:39|1641370719| 1|
If someone can be graceful enough to explain what am I doing wrong I would appreciate it