0

From the Flink 1.5 release announcement, we know Flink now supports "broadcast state", and it was described that "broadcast state unblocks the implementation of the “dynamic patterns” feature for Flink’s CEP library.".

Does it means currently we can use "broadcast state" to implement the “dynamic patterns” without Flink CEP ? Also I have no idea what's the difference when implementing the “dynamic patterns” for Flink CEP with or without broadcast state? I would appreciate If someone can give an example with code to explain the difference.

=============

Updating for testing broadcast data-stream by operator broadcast() with keyed-datastream

After testing in Flink 1.4.2, I found the broadcast datastream(by old operater broadcast()) can connect with keyed datastream, below is the test code, and we found all of the control stream events broadcast to all operator instance. So it seems the old broadcast() can achieve the same functionality as new "broadcast state" .

public static void ConnectBroadToKeyedStream() throws Exception {
    StreamExecutionEnvironment env =
            StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(3);

    List<Tuple1<String>>
            controlData = new ArrayList<Tuple1<String>>();
    controlData.add(new Tuple1<String>("DROP"));
    controlData.add(new Tuple1<String>("IGNORE"));
    DataStream<Tuple1<String>> control = env.fromCollection(controlData);//.keyBy(0);

    List<Tuple1<String>>
            dataStreamData = new ArrayList<Tuple1<String>>();
    dataStreamData.add(new Tuple1<String>("data"));
    dataStreamData.add(new Tuple1<String>("DROP"));
    dataStreamData.add(new Tuple1<String>("artisans"));
    dataStreamData.add(new Tuple1<String>("IGNORE"));
    dataStreamData.add(new Tuple1<String>("IGNORE"));
    dataStreamData.add(new Tuple1<String>("IGNORE"));
    dataStreamData.add(new Tuple1<String>("IGNORE"));

    // DataStream<String> data2 = env.fromElements("data", "DROP", "artisans", "IGNORE");
    DataStream<Tuple1<String>> keyedDataStream = env.fromCollection(dataStreamData).keyBy(0);

    DataStream<String> result = control
            .broadcast()
            .connect(keyedDataStream)
            .flatMap(new MyCoFlatMap());
    result.print();
    env.execute();
}

private static final class MyCoFlatMap
        implements CoFlatMapFunction<Tuple1<String>, Tuple1<String>, String> {
    HashSet blacklist = new HashSet();

    @Override
    public void flatMap1(Tuple1<String> control_value, Collector<String> out) {
        blacklist.add(control_value);
        out.collect("listed " + control_value);
    }

    @Override
    public void flatMap2(Tuple1<String> data_value, Collector<String> out) {

        if (blacklist.contains(data_value)) {
            out.collect("skipped " + data_value);
        } else {
            out.collect("passed " + data_value);
        }
    }
}

Below is the test result.

1> passed (data)
1> passed (DROP)
3> passed (artisans)
3> passed (IGNORE)
3> passed (IGNORE)
3> passed (IGNORE)
3> passed (IGNORE)
3> listed (DROP)
3> listed (IGNORE)
1> listed (DROP)
1> listed (IGNORE)
2> listed (DROP)
2> listed (IGNORE)

https://data-artisans.com/blog/apache-flink-1-5-0-release-announcement

YuFeng Shen
  • 1,475
  • 1
  • 17
  • 41

2 Answers2

2

Without broadcast state, two Flink data streams can not be processed together in a stateful way unless they are keyed in exactly the same way. A broadcast stream can be connected to a keyed stream, but if you then try to use keyed state in a RichCoFlatMap, for example, that will fail.

What is frequently desired is to be able to have one stream with dynamic "rules" that are to be applied to every event on another stream, regardless of key. There needed to be a new kind of managed Flink state in which these rules could be stored. With broadcast state this can now be done in a straightforward way.

With this feature now in place, work on support for dynamic patterns in CEP can begin.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Does the dataStream.broadcast() operater cannot connect to keyed DataStream ? If it can, what's the difference between using broadcast state and broadcast() operater? It seems both of them have the same effects. – YuFeng Shen May 27 '18 at 16:31
  • So to be more exactly, 1)broadcast stream cannot a keyed stream before Flink1.5, however from Flink1.5 broadcast stream can a keyed stream by broadcast state ,2)even before Flink 1.5,broadcast stream still can connect to a non-keyed stream,3)a keyed stream can not connect to another not-keyed stream, 4)however a keyed stream can connect to another keyed stream .please kindly correct me if any of 4 items is wrong. – YuFeng Shen May 27 '18 at 17:20
  • I did the test, and found that I can connect() a broadcast stream and a keyed stream successfully , and updated the test code to the original question,please kindly check it. – YuFeng Shen May 28 '18 at 14:45
  • Yes, I was mistaken. A broadcast stream and a keyed stream can be connected. What can't be done is to then use keyed state in something like a RichCoFlatMap to store the dynamic rules. – David Anderson May 28 '18 at 15:46
  • Can you kindly help to check this related question https://stackoverflow.com/questions/50570605/why-broadcast-state-can-store-the-dynamic-rules-however-broadcast-operator-c? – YuFeng Shen May 28 '18 at 16:45
  • BTW, in the example I gave above, Does the HashSet blacklist is not accounted to store the dynamic rules in the keyed stream? – YuFeng Shen May 28 '18 at 17:11
  • The HashSet in that example is not suitable because it will not be checkpointed, and therefore the application will not be fault tolerant, nor can it be rescaled. – David Anderson May 28 '18 at 20:17
  • If that is the case, how about using " HashSet in that example" with CheckpointedFunction? so it would be fault tolerant, and can be rescaled , and achieve the same effect like "broadcast state" , so this is a alternative to "broadcast state", do you think so? – YuFeng Shen May 30 '18 at 15:42
1

Here's a code sample which implements both the flink original broadcast method with no arguments and newly introduced broadcast state on flink 1.5.0. https://gist.github.com/syhily/932e0d1e0f12b3e951236d6c36e5a7ed

As far as I have learned, the broadcast state could be implemented without flink cep, just like the code showing above.

The original DataStream's broadcast method would create a DataStream instead of a BroadcastConnectedStream. This would be the original coGroup design scheme. We could use more stream transform function defined in ConnectedStreams after connecting the metrics stream with a broadcasted rule stream. Such as keyBy function, this would make the broadcasted stream and connected stream which have same key be processed and sticked on the same parallelled CoProcessFunction. So the CoProcessFunction could have its own local storage. The process function could have a custom data structure on its field other than a map state accessed from ReadOnlyContext.

Broadcast state could be implemented by a broadcast method with a set of MapStateDescriptor, this means the broadcasted stream could be connected with other stream many times. Different connected BroadcastConnectedStream could share its own broadcast state with a unique MapStateDescriptor in process function.

I thought these would be the key differences between broadcast with on arguments and broadcast state.

盛雨帆
  • 11
  • 1