2

I have some six PCollections as KV. I want to do ParDo on another PCollection by giving the combined (6) PCollections as sideInput.

I tried giving all 6 PCollections as separate sideInput as below

PCollection<TableRow> OutputRows = MyCollection.apply(ParDo.withSideInputs(Inp1, Inp2,...)
    .of(new DoFn<KV<String, String>, TableRow>() {
        ...
    }

But its throwing OutOfMemoryError as the heap space exceeds. Please advice on how to combine the PCollections to give as input to another PCollection.

Sathish Jayaram
  • 129
  • 3
  • 14

1 Answers1

2

Cloud Dataflow provides several ways of joining.

PCollections used as a side-input are broadcast to the worker and loaded into memory. This sounds like what you're doing and would explain the OOM if the sum of the PCollection sizes is too big.

You mentioned that the values are keyed -- another option is to use a CoGroupByKey.

To do this, you would create a KeyedPCollectionTuple with all of your PCollections, then you would get a result which had all the values for each key. Using a CoGroupByKey like this will shuffle your data around so that the ParDo that consumes the result for a given key will only need to read in the associated values:

PCollection<KV<K, V1>> inp1 = ...;
PCollection<KV<K, V2>> inp2 = ...;

final  TupleTag<V1> t1 = new  TupleTag<>();
final  TupleTag<V2> t2 = new  TupleTag<>();
PCollection<KV<K, CoGbkResult>> coGbkResultCollection =
  KeyedPCollectionTuple.of(t1, inp1)
                       .and(t2, inp2)
                       .apply(CoGroupByKey.<K>create());

PCollection<T> finalResultCollection =
  coGbkResultCollection.apply(ParDo.of(
   new  DoFn<KV<K, CoGbkResult>, T>() {
     @Override
     public void processElement(ProcessContext c) {
      KV<K, CoGbkResult> e = c.element();
      Iterable<V1> pt1Vals = e.getValue().getAll(t1);
      V2 pt2Val = e.getValue().getOnly(t2);
      ... Do Something ....
     c.output(...some T...);
   }
 }));
Ben Chambers
  • 6,070
  • 11
  • 16