Flink - Dataset - Can Flink respect the order of processing on multiple flows / input ?

Question

In my Flink batch program (DataSet / Table ), I am reading multiple file, this is producing differents flows, do some processing, and save it with output format
As flink is using dataflow model, and my flows are not really related, it is processing in parallel

Yet I want Flink to respect the order of my output operations at least, because I want flow1 to be save before flow2

For example I have something like :

Table table1 = tableEnv.fromTableSource(new MyTableSource1());
DataSet<Obj1> dataSet1 = talbeEnv.toDataSet(table1.select("toto",..),Obj1.class)
dataSet1.output(new WateverdatasinkSQL())

Table table2 = tableEnv.fromTableSource(new MyTableSource2());
DataSet<Obj2 dataSet2 = tableEnv.toDataSet(table2.select("foo","bar",..),Obj2.class)
dataSet2.output(new WateverdatasinkSQL())

I want flink to wait for dataSet1 to be save to continue...
How can I do it as successive operations ?
I have already looked at the execution modes, but this is not doing it

Regards, Bastien

score 2 · Accepted Answer · answered Aug 09 '18 at 14:33

2

The easiest solution is to separate both flows into individual jobs and execute them one after the other.

Table table1 = tableEnv.fromTableSource(new MyTableSource1());
DataSet<Obj1> dataSet1 = talbeEnv.toDataSet(table1.select("toto",..), Obj1.class);
dataSet1.output(new WateverdatasinkSQL());
env.execute();

Table table2 = tableEnv.fromTableSource(new MyTableSource2());
DataSet<Obj2> dataSet2 = tableEnv.toDataSet(table2.select("foo","bar",..), Obj2.class);
dataSet2.output(new WateverdatasinkSQL());
env.execute();

answered Aug 09 '18 at 14:33

Fabian Hueske

18,707
2
44
49

God, second time you save me today Fabian, thanks a lot :) I saw there is a sort of breakpoint with the iteration, flink is waiting for a superstep to finish before launching the next one, but this is not related to my use case, but can we have sort of superstep for grouping and sequencing flows in batch ? – Eldinea Aug 09 '18 at 14:42
There are some operations that inherently block a dataflow such as a full sort. However, in order to sync this with another dataflow, these would need to be connected which would result in rather messy code. I'd just run these jobs one after the other. – Fabian Hueske Aug 09 '18 at 15:00
Ok thanks, this is working like a charm by the way :) – Eldinea Aug 09 '18 at 15:08
Hi Fabian, I would you do if my source2 is actually my source 1 ? With this solution, my source will be read twice, as those are independant job ? (with 2 distinct execution plan ?) How can I do to link the end of the first output with the start of the second one ? Something like I need to insert in two database in a row, but I need to first to finish before starting the second insert – Eldinea Nov 06 '18 at 15:32
Hello @Fabien, this is not working when deploy on a cluster, as FLink execute only the first jobGraph.. – Eldinea Nov 23 '18 at 15:42

Flink - Dataset - Can Flink respect the order of processing on multiple flows / input ?

1 Answers1