7

I'm using Spark Structured Streaming to analyze sensor data and need to perform calculations based on a sensors previous timestamp. My incoming data stream has three columns: sensor_id, timestamp, and temp. I need to add a fourth column that is that sensors previous timestamp so that I can then calculate the time between data points for each sensor.

This is easy using traditional batch processing using a lag function and grouping by sensor_id. What is the best way to approach this in a streaming situation?

So for example if my streaming dataframe looked like this:

+----------+-----------+------+
| SensorId | Timestamp | Temp |
+----------+-----------+------+
|     1800 |        34 |   23 |
|      500 |        36 |   54 |
|     1800 |        45 |   23 |
|      500 |        60 |   54 |
|     1800 |        78 |   23 |
+----------+-----------+------+

I would like something like this:

+----------+-----------+------+---------+
| SensorId | Timestamp | Temp | Prev_ts |
+----------+-----------+------+---------+
|     1800 |        34 |   23 |      21 |
|      500 |        36 |   54 |      27 |
|     1800 |        45 |   23 |      34 |
|      500 |        60 |   54 |      36 |
|     1800 |        78 |   23 |      45 |
+----------+-----------+------+---------+

If I try

test = filteredData.withColumn("prev_ts", lag("ts").over(Window.partitionBy("sensor_id").orderBy("ts")))

I get an AnalysisException: 'Non-time-based windows are not supported on streaming DataFrames/Datasets

Could I save the previous timestamp of each sensor in a data structure that I could reference and then update with each new timestamp?

zero323
  • 322,348
  • 103
  • 959
  • 935
Matt S.
  • 71
  • 3
  • 1
    I took the GroupState suggestion and have been working off of [this blog post](https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html) but have yet to get anything working in a databricks notebook.. Please let me know if you figure something out! – Matt S. Mar 01 '18 at 15:04
  • Hi Matt...Did you get the solution for this problem? – Nagesh May 05 '19 at 07:36
  • Hey did you guys figured this out, how to solve this problem ? – Learnis Feb 09 '22 at 21:16
  • By storing in data structure, are you referring to storing the data in memory for query. Then yes you can create a temp view in spark and then join for last updated time. But this table will stay only for the lifetime of your code running and will take up memory – glory9211 Jun 28 '22 at 15:01

1 Answers1

-7

There is no need to "simulate" anything. Standard window functions can be used with Structured Streaming.

s = spark.readStream.
   ...
   load()

s.withColumn("prev_ts", lag("Temp").over(
  Window.partitionBy("SensorId").orderBy("Timestamp")
)
  • 5
    This code will not work with structured streaming since none-time-based windows are not supported. – Matt S. Feb 10 '18 at 22:07