0

enter image description here

Based on the image above, I need to share the state between two operators, at the moment, from one KeyedProcessFunction that will manage to process events and convert them from class X to class Y and keep the states for the incoming records to send always the latest information of the class Y to the Python Inference function.

The result from the Inference function needs to be mapped back into class Y and update the state of this object already created on the ProcessFunction before do Sink.

As far as I've read Broadcast state is not possible when RocksDB. "No RocksDB state backend: Broadcast state is kept in-memory at runtime and memory provisioning should be done accordingly. This holds for all operator states."

questions:

  1. What is the best way of doing that as I'm using RocksDB as State Backend?
  2. Is it possible to share states between a KeyedProcessFunction and a RichMapFunction?
Alter
  • 903
  • 1
  • 11
  • 27

1 Answers1

0

You can use broadcast state when using RocksDB as your state backend. The broadcast state won't be stored in RocksDB -- it will be on the heap -- but it will be checkpointed. So the broadcast state needs to be small enough to fit into memory. (Moreover, each task will independently checkpoint a copy of the broadcast state.)

However, I don't think broadcast state will help with this use case. It only broadcasts the state to all of the instances of a single operator.

You can't share state between operators. State is strictly local. You could stream the output of the process function into the RichMapFunction so that it has the necessary information. The map wouldn't be able to directly affect the state stored in the process function, but it could have its own copy of that state.

However, it sounds like you want the output of the inference function to modify the state in the process function. The DataStream API doesn't allow loops like that in the data flow. But you have a couple of options:

(1) Stream the results from the inference function out to something like kafka/kinesis, and then add that stream as another input to the process function. (In other words, loops are possible if you use an external message queue to decouple things. Of course, this adds latency.)

(2) Use the Stateful Functions API. It offers arbitrary communication patterns between stateful components (you aren't limited to a DAG), has excellent Python support, and much more. And all of this on top of the Flink runtime, so you get the same benefits regarding consistency, exactly once, scalability, etc.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • I'm not sure whether I explained myself or not, but the idea is that the state created in the `ProccessFunction` needs to be updated from the `RichFlatMap`(or whatever operator) after the Python Inference function sends the results. So I need to somehow catch the state and update it from the `RichFlatMap`(or whatever operator) to always have the latest state of the user in the `ProccessFunction`. – Alter Nov 02 '20 at 15:11
  • I'd saw right now the Stephan Ewen presentation about `Stateful Functions` and as you mentioned it looks pretty much like what I need. Is there any samples of how to share information between operators using `Stateful Functions`? – Alter Nov 02 '20 at 15:49
  • Is there any example of how to integrate DataStream API with Stateful Function API? – Alter Nov 02 '20 at 16:44
  • I believe there are tests that use this mechanism, but I'm not sure if there any examples per se. – David Anderson Nov 03 '20 at 13:29