Apache Beam Performance Between Python Vs Java Running on GCP Dataflow

Question

We have Beam data pipeline running on GCP dataflow written using both Python and Java. In the beginning, we had some simple and straightforward python beam jobs that works very well. So most recently we decided to transform more java beam to python beam job. When we having more complicated job, especially the job requiring windowing in the beam, we noticed that there is a significant slowness in python job than java job which end up using more cpu and memory and cost much more.

some sample python code looks like:

        step1 = (
        read_from_pub_sub
        | "MapKey" >> beam.Map(lambda elem: (elem.data[key], elem))
        | "WindowResults"
        >> beam.WindowInto(
            beam.window.SlidingWindows(360,90),
            allowed_lateness=args.allowed_lateness,
        )
        | "GroupById" >> beam.GroupByKey()

And Java code is like:

 PCollection<DataStructure> step1 =
      message
          .apply(
              "MapKey",
              MapElements.into(
                      TypeDescriptors.kvs(
                          TypeDescriptors.strings(), TypeDescriptor.of(DataStructure.class)))
                  .via(event -> KV.of(event.key, event)))
          .apply(
              "WindowResults",
              Window.<KV<String, CustomInterval>>into(
                      SlidingWindows.of(Duration.standardSeconds(360))
                          .every(Duration.standardSeconds(90)))
                  .withAllowedLateness(Duration.standardSeconds(this.allowedLateness))
                  .discardingFiredPanes())
          .apply("GroupById", GroupByKey.<String, DataStructure>create())

We noticed Python is always using like 3 more times CPU and memory than Java needed. We did some experimental tests that just ran JSON input and JSON output, same results. We are not sure that is just because Python, in general, is slower than java or the way the GCP Dataflow execute Beam Python and Java is different. Any similar experience, tests and reasons why this is are appreciated.

Have you tried to run your code in a local environment or apache spark? Does it have the same or has a different result with the performance? Also it can be how the code is written and the libraries that you are using. Java has a full set of features in Data Flow and Python has been catching up to the features but it’s absolutely not efficient. — Jose Gutierrez Paliza, Jan 21 '22 at 20:01

score 5 · Accepted Answer · answered Jan 21 '22 at 21:31

Yes, this is a very normal performance factor between Python and Java. In fact, for many programs the factor can be 10x or much more.

The details of the program can radically change the relative performance. Here are some things to consider:

Profiling the Dataflow job (official docs)
Profiling a Dataflow pipeline (medium blog)
Profiling Apache Beam Python pipelines (another medium blog)
Profiling Python (general Cloud Profiler docs)
How can I profile a Python Dataflow job? (previous StackOverflow question on profiling Python job)

If you prefer Python for its concise syntax or library ecosystem, the approach to achieve speed is to use optimized C libraries or Cython for the core processing, for example using pandas/numpy/etc. If you use Beam's new Pandas-compatible dataframe API you will automatically get this benefit.

Apache Beam Performance Between Python Vs Java Running on GCP Dataflow

1 Answers1