I am running a Dataflow batch job to join two PCollections on a common key. The two PCollections have millions of rows each: one is 8 miilion rows and the other is 2 milliosn rows. My job will complete by taking more than 4 hours! So I have checked SO posts on related topics as following:
- Dataflow Batch Job Stuck in GroupByKey.create()
- Complex join with google dataflow
- How to combine multiple PCollections together and give it as input to a ParDo function
But did not find any insghts on how to hanle this kind of large join within Dataflow. I have the following questions:
- Is Dataflow capable of joining two PCollections on a common key for large dataset (millions of rows each)?
- Will BQ be better suited to this kind of join?
- What are the possible solutions to handle this kind of use case with GCP big data stack?
Thanks in advance!
Edit: Mention Dataflow, GroupByKey and CoGroupByKey