1

I am running a Dataflow batch job to join two PCollections on a common key. The two PCollections have millions of rows each: one is 8 miilion rows and the other is 2 milliosn rows. My job will complete by taking more than 4 hours! So I have checked SO posts on related topics as following:

But did not find any insghts on how to hanle this kind of large join within Dataflow. I have the following questions:

  1. Is Dataflow capable of joining two PCollections on a common key for large dataset (millions of rows each)?
  2. Will BQ be better suited to this kind of join?
  3. What are the possible solutions to handle this kind of use case with GCP big data stack?

Thanks in advance!

Edit: Mention Dataflow, GroupByKey and CoGroupByKey

  • How are the values in your collections distributed across keys? Could you share a job ID so we can look for any other issues that may be occurring? – Ben Chambers Aug 29 '17 at 17:45
  • Thanks Ben for the prompt action. I was wondering if we are having hot key issues. Here is the job ID for your review: 2017-08-23_12_04_24-3505646281616846076 – Steve Kileen Aug 29 '17 at 18:21
  • Seems like this job took about 1 hour to run, but 3 hours to schedule. Perhaps you were running too many Dataflow jobs at the same time and ran out of quota? https://cloud.google.com/dataflow/quotas The 1-hour time is also concerning, but it appears that it was because of hot keys: e.g. you had single keys of up to 40 minutes long of processing. Could you share pseudocode of how you perform the join on the CoGbkResult? How many values per key do you typically have in each of the collections? – jkff Aug 29 '17 at 19:33
  • @jkff, is the scheduling info available on console somewhere? All I found is elapsed time 4 hours when looking at the job execution. The following are the pseudo code: ``` PCollection> inp1 = ...; PCollection> inp2 = ...; final TupleTag t1 = new TupleTag<>(); final TupleTag t2 = new TupleTag<>(); PCollection> coGbkResultCollection = KeyedPCollectionTuple.of(t1, inp1) .and(t2, inp2) .apply(CoGroupByKey.create()); ``` Will share the key-value distribution info soon. – Steve Kileen Aug 29 '17 at 20:39
  • Apologies, I misunderstood the logs - it actually ran for 4 hours. But I see that the first collection, it seems, had 190M rows rather than 2M - yet still it shouldn't take this long. I'm more interested in the pseudocode of what you do with the CoGroupByKeyResult (e.g. do you have a nested loop over each tagged iterator being grouped? or do you save one of them into a list or map? etc.). – jkff Aug 29 '17 at 20:48
  • @jkff, we are looping through the CoGroupByKeyResult to do some string assignment operations. ` PCollectionTuple result = coGbkResultCollection.apply("Look up product", ParDo.of(new DoFn, KV>() { @Override public void processElement(ProcessContext c) throws Exception { KV e = c.element(); Iterable tupleAT = e.getValue() .getAll(tupleAvroTrans); for (V1 df : tupleAT) {} } }).withOutputTags(out1, TupleTagList.of(outTag)));` **How do I email the full codes to you?** – Steve Kileen Aug 30 '17 at 20:57
  • OK, to clarify some more: do you have a nested loop inside this "for (V1 df : tupleAT)" iterating over the result of another getAll()? Try changing your code a bit: List tupleAT = new ArrayList(e.getValue().getAll(tupleAvroTrans)) and likewise for the other side of the join. It might help a lot. If it doesn't, contact dataflow-feedback@google.com with more details. – jkff Aug 30 '17 at 23:59
  • thanks @jkff, we do NOT have a nested loop inside the "for (V1 df : tupleAT)" iterating over the result of another getAll(). Will give a try by changing from iterable to list and then update if it helps. – Steve Kileen Aug 31 '17 at 21:20

0 Answers0