9

How can I calculate aggregations on a window, from a sensor when new events are only sent if the sensor value has changed since the last event? The sensor readings are taken at fixed times, e.g. every 5 seconds, but are only forwarded if the reading changes since the last reading.

So, if I'm would like to create an average of signal_stength for each device:

eventsDF = ... 
avgSignalDF = eventsDF.groupBy("deviceId").avg("signal_strength")

For example, events sent by the device for a one minute window:

event_time  device_id  signal_strength
12:00:00    1          5
12:00:05    1          4
12:00:30    1          5
12:00:45    1          6
12:00:55    1          5

The same dataset with the events that aren't actually sent filled in:

event_time  device_id  signal_strength
12:00:00    1          5
12:00:05    1          4
12:00:10    1          4
12:00:15    1          4
12:00:20    1          4
12:00:25    1          4
12:00:30    1          5
12:00:35    1          5
12:00:40    1          5
12:00:45    1          6
12:00:50    1          6
12:00:55    1          5

The signal_strength sum is 57 and the avg is 57/12

How can this missing data be inferred by spark structured streaming and the average calculated from the inferred values?

Note: I have used average as an example of an aggregation, but the solution needs to work for any aggregation function.

Chris Snow
  • 23,813
  • 35
  • 144
  • 309
  • You need to dataset having the device id with previous average and left join to event received take the device id from the left and value from the right and if it’s is null use the previous average from right – maxmithun Oct 10 '18 at 22:47
  • Would you be able to show this with an example data set and code snippet? – Chris Snow Oct 11 '18 at 17:10
  • [Arbitrary Stateful Operations](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations)? – zero323 Oct 13 '18 at 11:20

2 Answers2

2

EDITED:

I have modified the logic to compute the average only from the filtered dataframe, so that it addresses the gaps.

//input structure
case class StreamInput(event_time: Long, device_id: Int, signal_strength: Int)
//columns for which we want to maintain state
case class StreamState(prevSum: Int, prevRowCount: Int, prevTime: Long, prevSignalStrength: Int, currentTime: Long, totalRow: Int, totalSum: Int, avg: Double)
//final result structure
case class StreamResult(event_time: Long, device_id: Int, signal_strength: Int, avg: Double)

val filteredDF = ???  //get input(filtered rows only)

val interval = 5  // event_time interval

// using .mapGroupsWithState to maintain state for runningSum & total row count till now

// you need to set the timeout threshold to indicate how long you wish to maintain the state
val avgDF = filteredDF.groupByKey(_.device_id)
  .mapGroupsWithState[StreamState, StreamResult](GroupStateTimeout.NoTimeout()) {

  case (id: Int, eventIter: Iterator[StreamInput], state: GroupState[StreamState]) => {
    val events = eventIter.toSeq

    val updatedSession = if (state.exists) {
      //if state exists update the state with the new values
      val existingState = state.get

      val prevTime = existingState.currentTime
      val currentTime = events.map(x => x.event_time).last
      val currentRowCount = (currentTime - prevTime)/interval
      val rowCount = existingState.rowCount + currentRowCount.toInt
      val currentSignalStength = events.map(x => x.signal_strength).last

      val total_signal_strength = currentSignalStength + 
        (existingState.prevSignalStrength * (currentRowCount -1)) + 
        existingState.total_signal_strength

      StreamState(
        existingState.total_signal_strength,
        existingState.rowCount,
        prevTime,
        currentSignalStength,
        currentTime,
        rowCount,
        total_signal_strength.toInt,
        total_signal_strength/rowCount.toDouble
      )

    } else {
      // if there are no earlier state
      val runningSum = events.map(x => x.signal_strength).sum
      val size = events.size.toDouble
      val currentTime = events.map(x => x.event_time).last
      StreamState(0, 1, 0, runningSum, currentTime, 1, runningSum, runningSum/size)
    }

    //save the updated state
    state.update(updatedSession)
    StreamResult(
      events.map(x => x.event_time).last,
      id,
      events.map(x => x.signal_strength).last,
      updatedSession.avg
    )
  }
}

val result = avgDF
  .writeStream
  .outputMode(OutputMode.Update())
  .format("console")
  .start

The idea is to calculate two new Columns:

  1. totalRowCount: the running total of number of rows that are supposed to be present if you have not filtered.
  2. total_signal_strength: the running total of signal_strength till now. (this INCLUDES missed row totals too).

Its calculated by:

total_signal_strength = 
  current row's signal_strength  +  
  (total_signal_strength of previous row * (rowCount -1)) + 
  //rowCount is the count of missed rows computed by comparing previous and current event_time.
  previous total_signal_strength

format of the intermediate state:

+----------+---------+---------------+---------------------+--------+
|event_time|device_id|signal_strength|total_signal_strength|rowCount|
+----------+---------+---------------+---------------------+--------+
|         0|        1|              5|                    5|       1|
|         5|        1|              4|                    9|       2|
|        30|        1|              5|                   30|       7|
|        45|        1|              6|                   46|      10|
|        55|        1|              5|                   57|      12|
+----------+---------+---------------+---------------------+--------+

final output:

+----------+---------+---------------+-----------------+
|event_time|device_id|signal_strength|              avg|
+----------+---------+---------------+-----------------+
|         0|        1|              5|              5.0|
|         5|        1|              4|              4.5|
|        30|        1|              5|4.285714285714286|
|        45|        1|              6|              4.6|
|        55|        1|              5|             4.75|
+----------+---------+---------------+-----------------+
vdep
  • 3,541
  • 4
  • 28
  • 54
0

Mathematically equivalent to a weighted average problem based on duration:

avg=(signal_strength*duration)/60

the challenge here is to get duration for each signal, one option here is for each micro-batch, collect result in driver then it`s all statistic problem,to get duration you can do a left shift on start time then subtracts, something like this:

window.start.leftShift(1)-window.start

which would give you:

event_time  device_id  signal_strength duration
12:00:00    1          5                  5(5-0)
12:00:05    1          4                  25(30-5)
12:00:30    1          5                  15(45-30)
12:00:45    1          6                  10(55-45)
12:00:55    1          5                  5 (60-55)

(5*5+4*25+5*15+6*10+5*5)/60=57/12

As of Spark structured streaming 2.3.2, you need to write your own customized sink to collect the result of each stage to driver and do the math work like that.

zero323
  • 322,348
  • 103
  • 959
  • 935
dunlu_98k
  • 209
  • 2
  • 3
  • 11