How can I profile a Python Dataflow job?

Question

I've written a Python Dataflow job to process some data:

pipeline
| "read" >> beam.io.ReadFromText(known_args.input)  # 9 min 44 sec
| "parse_line" >> beam.Map(parse_line)  # 4 min 55 sec
| "add_key" >> beam.Map(add_key)  # 48 sec
| "group_by_key" >> beam.GroupByKey()  # 11 min 56 sec
| "map_values" >> beam.ParDo(MapValuesFn())  # 11 min 40 sec
| "json_encode" >> beam.Map(json.dumps)  # 26 sec
| "output" >> beam.io.textio.WriteToText(known_args.output)  # 22 sec

(I have removed business-specific language.)

The input is a 1.36 GiB gz-compressed CSV, yet the job takes 37 min 34 sec to run (I am using Dataflow as I expect the input to grow rapidly in size).

How can I identify the bottle-necks in the pipeline and speed up its execution? None of the individual functions are computationally expensive.

Autoscaling information from the Dataflow console:

12:00:35 PM     Starting a pool of 1 workers. 
12:05:02 PM     Autoscaling: Raised the number of workers to 2 based on the rate of progress in the currently running step(s).
12:10:02 PM     Autoscaling: Reduced the number of workers to 1 based on the rate of progress in the currently running step(s).
12:29:09 PM     Autoscaling: Raised the number of workers to 3 based on the rate of progress in the currently running step(s).
12:35:10 PM     Stopping worker pool.

Maybe you could ask at dev@beam.apache.org? – Rui Wang Jul 08 '19 at 17:22 — Rui Wang, Jul 08 '19 at 17:22

score 0 · Answer 1 · answered Jul 08 '19 at 17:26

I searched dev@beam.apache.org and found there was a thread which discussed this topic:https://lists.apache.org/thread.html/f8488faede96c65906216c6b4bc521385abeddc1578c99b85937d2f2@%3Cdev.beam.apache.org%3E

You could check this thread for useful information and/or raise questions/requirements/discussions if there is a need.

score 0 · Answer 2 · answered Jul 16 '19 at 15:52

Accidentally, I found that the issue in this case was the compression of the CSV.

The input was a single gz-compressed CSV. So I could inspect the data more easily, I switched to an uncompressed CSV. That reduced the processing time to under 17 minutes and Dataflow's autoscaling peaked at 10 workers.

(If I still needed the compression, I would split the CSV into parts then compress each one individually.)

score 0 · Answer 3 · answered May 20 '21 at 02:45

0

I came across this Python Profiler package by Google: https://cloud.google.com/profiler/docs/profiling-python

answered May 20 '21 at 02:45

Reinaldo Aguiar

71
4

How can I profile a Python Dataflow job?

3 Answers3

Linked