I've written a Python Dataflow job to process some data:
pipeline
| "read" >> beam.io.ReadFromText(known_args.input) # 9 min 44 sec
| "parse_line" >> beam.Map(parse_line) # 4 min 55 sec
| "add_key" >> beam.Map(add_key) # 48 sec
| "group_by_key" >> beam.GroupByKey() # 11 min 56 sec
| "map_values" >> beam.ParDo(MapValuesFn()) # 11 min 40 sec
| "json_encode" >> beam.Map(json.dumps) # 26 sec
| "output" >> beam.io.textio.WriteToText(known_args.output) # 22 sec
(I have removed business-specific language.)
The input is a 1.36 GiB gz-compressed CSV, yet the job takes 37 min 34 sec to run (I am using Dataflow as I expect the input to grow rapidly in size).
How can I identify the bottle-necks in the pipeline and speed up its execution? None of the individual functions are computationally expensive.
Autoscaling information from the Dataflow console:
12:00:35 PM Starting a pool of 1 workers.
12:05:02 PM Autoscaling: Raised the number of workers to 2 based on the rate of progress in the currently running step(s).
12:10:02 PM Autoscaling: Reduced the number of workers to 1 based on the rate of progress in the currently running step(s).
12:29:09 PM Autoscaling: Raised the number of workers to 3 based on the rate of progress in the currently running step(s).
12:35:10 PM Stopping worker pool.