Simulate Lag Function - Spark structured streaming

Question

I'm using Spark Structured Streaming to analyze sensor data and need to perform calculations based on a sensors previous timestamp. My incoming data stream has three columns: sensor_id, timestamp, and temp. I need to add a fourth column that is that sensors previous timestamp so that I can then calculate the time between data points for each sensor.

This is easy using traditional batch processing using a lag function and grouping by sensor_id. What is the best way to approach this in a streaming situation?

So for example if my streaming dataframe looked like this:

+----------+-----------+------+
| SensorId | Timestamp | Temp |
+----------+-----------+------+
|     1800 |        34 |   23 |
|      500 |        36 |   54 |
|     1800 |        45 |   23 |
|      500 |        60 |   54 |
|     1800 |        78 |   23 |
+----------+-----------+------+

I would like something like this:

+----------+-----------+------+---------+
| SensorId | Timestamp | Temp | Prev_ts |
+----------+-----------+------+---------+
|     1800 |        34 |   23 |      21 |
|      500 |        36 |   54 |      27 |
|     1800 |        45 |   23 |      34 |
|      500 |        60 |   54 |      36 |
|     1800 |        78 |   23 |      45 |
+----------+-----------+------+---------+

If I try

test = filteredData.withColumn("prev_ts", lag("ts").over(Window.partitionBy("sensor_id").orderBy("ts")))

I get an AnalysisException: 'Non-time-based windows are not supported on streaming DataFrames/Datasets

Could I save the previous timestamp of each sensor in a data structure that I could reference and then update with each new timestamp?

I took the GroupState suggestion and have been working off of [this blog post](https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html) but have yet to get anything working in a databricks notebook.. Please let me know if you figure something out! — Matt S., Mar 01 '18 at 15:04
Hey did you guys figured this out, how to solve this problem ? — Learnis, Feb 09 '22 at 21:16
By storing in data structure, are you referring to storing the data in memory for query. Then yes you can create a temp view in spark and then join for last updated time. But this table will stay only for the lifetime of your code running and will take up memory — glory9211, Jun 28 '22 at 15:01

score -7 · Answer 1 · answered Feb 10 '18 at 21:47

-7

There is no need to "simulate" anything. Standard window functions can be used with Structured Streaming.

s = spark.readStream.
   ...
   load()

s.withColumn("prev_ts", lag("Temp").over(
  Window.partitionBy("SensorId").orderBy("Timestamp")
)

answered Feb 10 '18 at 21:47

user9343933

1
1

5

This code will not work with structured streaming since none-time-based windows are not supported. – Matt S. Feb 10 '18 at 22:07

Simulate Lag Function - Spark structured streaming

1 Answers1