Spark arbitrary stateful stream aggregation, flatMapGroupsWithState API

Question

10 days old spark developer, trying to understand the flatMapGroupsWithState API of spark.

As I understand:

We pass 2 options to it which are timeout configuration. A possible value is GroupStateTimeout.ProcessingTimeTimeout i.e. kind of an instruction to spark to consider processing time and not event time. Other is the output mode.
We pass in a function, lets say myFunction, that is responsible for setting the state for each key. And we also set a timeout duration with groupState.setTimeoutDuration(TimeUnit.HOURS.toMillis(4)), assuming groupState is the instance of my groupState for a key.

As I understand, as micro batches of stream data keep coming in, spark maintain an intermediate state as we define in user defined function. Lets say the intermediate state after processing n micro batches of data is as follows:

State for Key1:

{
  key1: [v1, v2, v3, v4, v5]
}

State for key2:

{
   key2: [v11, v12, v13, v14, v15]
}

For any new data that come in, myFunction is called with state for the particular key. Eg. for key1, myFunction is called with key1, new key1 values, [v1,v2,v3,v4,v5] and it updates the key1 state as per the logic.

I read about the timeout and I found Timeout dictates how long we should wait before timing out some intermediate state.

Questions:

If this process run indefinitely, my intermediate states will keep on piling and hit the memory limits on nodes. So when are these intermediate states cleared. I found that in case of event time aggregation, watermarks dictates when the intermediate states will be cleared.
What does timing out the intermediate state mean in the context of Processing time.

Bartosz Konieczny · Answer 1 · 2019-08-26T13:51:47.933

If this process run indefinitely, my intermediate states will keep on piling and hit the memory limits on nodes. So when are these intermediate states cleared. I found that in case of event time aggregation, watermarks dictates when the intermediate states will be cleared.

Apache Spark will mark them as expired after the expiration time, so in your example after 4 hours of inactivity (real time + 4 hours, inactivity = no new event updating the state).

What does timing out the intermediate state mean in the context of Processing time.

It means that it will time out accordingly to the real clock (processing time, org.apache.spark.util.SystemClock class). You can check what clock is currently used by analyzing org.apache.spark.sql.streaming.StreamingQueryManager#startQuery triggerClock parameter.

You will find more details in FlatMapGroupsWithStateExec class, and more particularly here:

// Generate a iterator that returns the rows grouped by the grouping function
// Note that this code ensures that the filtering for timeout occurs only after
// all the data has been processed. This is to ensure that the timeout information of all
// the keys with data is updated before they are processed for timeouts.
val outputIterator =
  processor.processNewData(filteredIter) ++ processor.processTimedOutState()

And if you analyze these 2 methods, you will see that:

processNewData applies mapping function to all active keys (present in the micro-batch)

    /**
     * For every group, get the key, values and corresponding state and call the function,
     * and return an iterator of rows
     */
    def processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow] = {
      val groupedIter = GroupedIterator(dataIter, groupingAttributes, child.output)
      groupedIter.flatMap { case (keyRow, valueRowIter) =>
        val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow]
        callFunctionAndUpdateState(
          stateManager.getState(store, keyUnsafeRow),
          valueRowIter,
          hasTimedOut = false)
      }
    }

processTimedOutState calls the mapping function on all expired states

    def processTimedOutState(): Iterator[InternalRow] = {
      if (isTimeoutEnabled) {
        val timeoutThreshold = timeoutConf match {
          case ProcessingTimeTimeout => batchTimestampMs.get
          case EventTimeTimeout => eventTimeWatermark.get
          case _ =>
            throw new IllegalStateException(
              s"Cannot filter timed out keys for $timeoutConf")
        }
        val timingOutPairs = stateManager.getAllState(store).filter { state =>
          state.timeoutTimestamp != NO_TIMESTAMP && state.timeoutTimestamp < timeoutThreshold
        }
        timingOutPairs.flatMap { stateData =>
          callFunctionAndUpdateState(stateData, Iterator.empty, hasTimedOut = true)
        }
      } else Iterator.empty
    }

An important point to notice here is that Apache Spark will keep expired state in the state store if you don't invoke GroupState#remove method. The expired states won't be returned for processing though because they're flagged with NO_TIMESTAMP field. However, they will be stored in the state store delta files which may slow down the reprocessing if you need to reload the most recent state. If you analyze FlatMapGroupsWithStateExec again, you will see that the state is removed only when the state removed flag is set to true:

def callFunctionAndUpdateState(...)
  // ...
  // When the iterator is consumed, then write changes to state
  def onIteratorCompletion: Unit = {
  if (groupState.hasRemoved && groupState.getTimeoutTimestamp == NO_TIMESTAMP) {
    stateManager.removeState(store, stateData.keyRow)
    numUpdatedStateRows += 1
  } else {
    val currentTimeoutTimestamp = groupState.getTimeoutTimestamp
    val hasTimeoutChanged = currentTimeoutTimestamp != stateData.timeoutTimestamp
    val shouldWriteState = groupState.hasUpdated || groupState.hasRemoved || hasTimeoutChanged

    if (shouldWriteState) {
      val updatedStateObj = if (groupState.exists) groupState.get else null
      stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp)
      numUpdatedStateRows += 1
    }
  }
}

When is this processed data written to sink. Where is that configured? @bartosz25 — Gaurav Kumar, Aug 26 '19 at 15:24
It depends on the output mode You can find all output mode semantics for flatMapGroupsWithState here: https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.html#output-modes — Bartosz Konieczny, Aug 26 '19 at 18:02

Spark arbitrary stateful stream aggregation, flatMapGroupsWithState API

1 Answers1