What is the difference between a JoinFunction
and a CoGroupFunction
in Apache Flink? How do semantics and execution differ?
Asked
Active
Viewed 3,676 times
9

Fabian Hueske
- 18,707
- 2
- 44
- 49

Jary zhen
- 437
- 6
- 18
1 Answers
26
Both, Join and CoGroup transformations join two inputs on key fields. The differences is how the user functions are called:
- the Join transformation calls the
JoinFunction
with pairs of matching records from both inputs that have the same values for key fields. This behavior is very similar to an equality inner join. - the CoGroup transformation calls the
CoGroupFunction
with iterators over all records of both inputs that have the same values for key fields. If an input has no records for a certain key value an empty iterator is passed. The CoGroup transformation can be used, among other things, for inner and outer equality joins. It is hence more generic than the Join transformation.
Looking at the execution strategies of Join and CoGroup, Join can be executed using sort- and hash-based join strategies where as CoGroup is always executed using sort-based strategies. Hence, joins are often more efficient than cogroups and should be preferred if possible.

Fabian Hueske
- 18,707
- 2
- 44
- 49
-
If I have two IN1 and one IN2 falled in a window, how many times JoinFunction will be called and with which arguments? – fixxer Aug 29 '18 at 08:32
-
1The `JoinFunction` is called once for each pair of the cross product. In your case that's two times for `(IN1_1, IN2_1)` and `(IN1_2, IN2_1)`. – Fabian Hueske Aug 29 '18 at 08:42
-
If there are following elements in a window from 2 streams. [(id1, t1v1), (id1, t1v2), (id1, t1v3)] in stream1 and [(id1, t2v1), (id1, t2v2)] in stream2. Then coGroup will be called with Iterator[t1v1, t1v2, t1v3] and Iterator[t2v1, t2v2] for id1 while JoinFunction will be called 6 times with each value in the cartesian product of above set of values ie. (t1v1, t2v1), (t1v2, t2v1), (t1v3, t2v1), (t1v1, t2v2), (t1v2, t2v2), (t1V3, t2V2). Is that understanding correct. @FabianHueske – Gaurav Kumar Nov 27 '19 at 08:51
-
If using joins, how to capture element without any matching key? – Nilesh Jun 11 '20 at 12:51
-
If you use the DataSet API you can use Outer Joins. Not sure if they are supported by the DataStream API yet. If not, you can fall back to CoGroup. – Fabian Hueske Jun 11 '20 at 13:16