1

I am wondering if Dataflow is able to parallelize loading single, potentially huge file. I know that if for example 10 files are loaded, parallelism is applied and those files are loading in parallel. But what about loading single huge file? Does Dataflow split it somehow and load it in parallel too, or this is kind of bottleneck?

In general let's analyze two scenarios. Both of them are: Loading data in streaming mode from GCS.

  1. Using Apache Beam class ReadAllFromText.
  2. Using Apache beam class GcsIO within DoFn. I know that ParDo will parallelize DoFn class, but this parallelism can be perform if we process several elements. In example I am interested in we speak about single element (single file).
Pav3k
  • 869
  • 4
  • 10

0 Answers0