Cloud Dataflow provides several ways of joining.
PCollection
s used as a side-input are broadcast to the worker and loaded into memory. This sounds like what you're doing and would explain the OOM if the sum of the PCollection
sizes is too big.
You mentioned that the values are keyed -- another option is to use a CoGroupByKey.
To do this, you would create a KeyedPCollectionTuple
with all of your PCollection
s, then you would get a result which had all the values for each key. Using a CoGroupByKey
like this will shuffle your data around so that the ParDo that consumes the result for a given key will only need to read in the associated values:
PCollection<KV<K, V1>> inp1 = ...;
PCollection<KV<K, V2>> inp2 = ...;
final TupleTag<V1> t1 = new TupleTag<>();
final TupleTag<V2> t2 = new TupleTag<>();
PCollection<KV<K, CoGbkResult>> coGbkResultCollection =
KeyedPCollectionTuple.of(t1, inp1)
.and(t2, inp2)
.apply(CoGroupByKey.<K>create());
PCollection<T> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
new DoFn<KV<K, CoGbkResult>, T>() {
@Override
public void processElement(ProcessContext c) {
KV<K, CoGbkResult> e = c.element();
Iterable<V1> pt1Vals = e.getValue().getAll(t1);
V2 pt2Val = e.getValue().getOnly(t2);
... Do Something ....
c.output(...some T...);
}
}));