2

I'm working on a Dataflow pipeline that reads 1000 files (50 MB each) from GCS and performs some computations on the rows across all files. Each file is a CSV with the same structure, just with different numbers in it, and I'm computing the average of each cell over all files.

The pipeline looks like this (python):

additional_side_inputs = {'key1': 'value1', 'key2': 'value2'}  # etc.

p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
  | 'Read files' >> ReadMatches()
  | 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
  | 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())

The FileToRowsFn class looks like this (see below, some details omitted). The row_id is the 1st column and is a unique key of each row; it exists exactly once in each file, so that I can compute the average over all the files. There's some additional value provided as side inputs to the transformer, which is not shown inside the method body below, but is still used by the real implementation. This value is a dictionary that is created outside of the pipeline. I mention it here in case this might be a reason for the lack of parallelization.

class FileToRowsFn(beam.DoFn):
  def process(self, file_element, additional_side_inputs):
    with file_element.open() as csv_file:
      for row_id, *values in csv.reader(TextIOWrapper(csv_file, encoding='utf-8')):
        yield row_id, values

The AverageCalculatorFn is a typical beam.CombineFn with an accumulator, that performs the average of each cell of a given row over all rows with the same row_id across all files.

All this works fine, but there's a problem with performances and throughput: it takes more than 60 hours to execute this pipeline. From the monitoring console, I notice that the files are read sequentially (1 file every 2 minutes). I understand that reading a file may take 2 minutes (each file is 50 MB), but I don't understand why dataflow doesn't assign more workers to read multiple files in parallel. The cpu remains at ~2-3% because most of the time is spent in file IO, and the number of workers doesn't exceed 2 (although no limit is set).

The output of ReadMatches is 1000 file records, so why doesn't dataflow create lots of FileToRowsFn instances and dispatch them to new workers, each one handling a single file?

Is there a way to enforce such a behavior?

Gaetan
  • 2,802
  • 2
  • 21
  • 26
  • Your code above yields a PCollection of object names for 'Read Files'; it may be that those are getting put together in a bundle on a single worker. Basically, the pipeline may not understand that each of those entries in the input collection is going to yield thousands in the output. I'd recommend trying ReadFromText (https://beam.apache.org/releases/pydoc/2.31.0/apache_beam.io.textio.html) which should be able to break individual files into chunks as needed. – Jeff Klukas Aug 18 '21 at 13:18

1 Answers1

0

This is probably because all your steps get fused into a single step by the Dataflow runner.

For such a fused bundle to parallelize, the first step needs to be parallelizable. In your case this is a glob expansion which is not parallelizable.

To make your pipeline parallelizable, you can try to break fusion. This can be done by adding a Reshuffle transform as the consumer of one of the steps that produce many elements.

For example,

from apache_beam import Reshuffle

additional_side_inputs = {'key1': 'value1', 'key2': 'value2'}  # etc.

p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
  | 'Read files' >> ReadMatches()
  | 'Reshuffle' >> Reshuffle()
  | 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
  | 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())

You should not have to do this if you use one of the standard sources available in Beam such as textio.ReadFromText() to read data. (unfortunately we do not have a CSV source but ReadFromText supports skipping header lines).

See here for more information regarding the fusion optimization and preventing fusion.

chamikara
  • 1,896
  • 1
  • 9
  • 6