0

I ran into this problem when I was trying to create a custom source of event. Which contains a queue that allow my other process to add items into it. Then expect my CEP pattern to print some debug messages when there is a match.

But there is no match no matter what I add to the queue. Then I notice that the queue inside mySource.run() is always empty. Which means the queue I used to create the mySource instance is not the same as the one inside StreamExecutionEnvironment. If I change the queue to static, force all instances to share the same queue, everything works as expected.

DummySource.java

    public class DummySource implements SourceFunction<String> {

    private static final long serialVersionUID = 3978123556403297086L;
//  private static Queue<String> queue = new LinkedBlockingQueue<String>();
    private Queue<String> queue;
    private boolean cancel = false;

    public void setQueue(Queue<String> q){
        queue = q;
    }   

    @Override
    public void run(org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext<String> ctx)
            throws Exception {
        System.out.println("run");
        synchronized (queue) {          
            while (!cancel) {
                if (queue.peek() != null) {
                    String e = queue.poll();
                    if (e.equals("exit")) {
                        cancel();
                    }
                    System.out.println("collect "+e);
                    ctx.collectWithTimestamp(e, System.currentTimeMillis());
                }
            }
        }
    }

    @Override
    public void cancel() {
        System.out.println("canceled");
        cancel = true;
    }
}

So I dig into the source code of StreamExecutionEnvironment. Inside the addSource() method. There is a clean() method which looks like it replaces the instance to a new one.

Returns a "closure-cleaned" version of the given function.

Why is that? and Why it needs to be serialize? I've also try to turn off the clean closure using getConfig(). The result is still the same. My queue instance is not the same one which env is using.

How do I solve this problem?

Maxi Wu
  • 1,274
  • 3
  • 20
  • 38

1 Answers1

1

The clean() method used on functions in Flink is mainly to ensure the Function(like SourceFunction, MapFunction) serialisable. Flink will serialise those functions and distribute them onto task nodes to execute them.

For simple variables in your Flink main code, like int, you can simply reference them in your function. But for the large or not-serialisable ones, better using broadcast and rich source function. Please refer to https://cwiki.apache.org/confluence/display/FLINK/Variables+Closures+vs.+Broadcast+Variables

BrightFlow
  • 1,294
  • 8
  • 13
  • my flink program is not parallel. But broadcast variable does solve my problem. Does parallel means when I use stream.keyBy("")? than each key will have its own task node? – Maxi Wu Aug 14 '18 at 08:50
  • 1
    Probably `parallel` you mentioned here means running in a cluster. No matter how you run the Flink Job, in standalone mode(running on local machine with Master and Task nodes) or in cluster mode(Flink Cluster or YARN Cluster), Flink always serialises `functions`, distributes them and executes them in Task. – BrightFlow Aug 14 '18 at 23:14
  • after reading the document again, looks like broadcast is used with DataSet and the broadcast variable needs to be known before sending task to a node. I might need to look into TaskManager for more complex scenario. – Maxi Wu Aug 23 '18 at 09:45