0

Our pipeline is developed based on the Apache Beam Go SDK. I'm trying to profile the CPU of all workers by setting the flag --cpu_profiling=gs://gs_location: https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/dataflow/dataflow.go

The job finished with spending 16.636 vCPU hr and a maximum number of 104 workers: enter image description here

As a result in the specified GCS location, a bunch of files are recorded with name "profprocess_bundle-*": Saved profiling files

Then I downloaded these files, unzipped them all, and visualize the results with pprof (https://github.com/google/pprof): enter image description here

So here are my questions:

  1. How is the total time in the profiling result collected? The sampled time (1.06 hrs) is way shorter than the vCPU time (16.626 hrs) reported by Dataflow.

  2. What is the the number in the file name "profprocess_bundle-*"? I was thinking it may correspond to the index of a worker. But the maximum of the number is larger than the worker number, and the number is not continuous. The maximum number is 122, but there are only 66 files.

Tao Liao
  • 25
  • 5

1 Answers1

1

when you set --cpu_profiling=true, the profiling starts when the SDK worker starts processing a bundle (a batch of input elements will go through a subgraph of your pipeline DAG, sometimes also referred as work item) and ends when the processing finishes. A job can contain many bundles. That's why the total vCPU time will be larger than the sample period.

As mentioned the number in profprocess_bundle-* is representing the bundle id being profiled.

Yichi Zhang
  • 351
  • 1
  • 5
  • Thanks for help Yichi! Does that mean that the workers are not processing work items for most of the time? Comparing the sampled time 1.06 hrs and the vCPU time 16.626 hrs – Tao Liao Apr 14 '21 at 17:46
  • vCPU time is the number of cores multiplied by how long the job runs. Profiles only last for each work item, and the job will have consecutive work items scheduled to be executed throughout the runtime. The two time you mentioned are largely not related. – Yichi Zhang Apr 15 '21 at 18:15