1

long time lurker, first time poster.

I am currently trying to add execution of Python UDFs in a Java Dataflow pipeline. These UDFs would potentially employ third-party libraries and multiple modules.

I have been reading a lot about Apache Beam multi-language pipelines, but I am becoming less certain that is the way to achieve this. I'll first summarize my understanding of how multi-language pipelines are working (I'm kind of hoping I'm missing or misunderstanding something), before getting to my questions.

Beam Multi-language Pipelines

The main purpose of a multi-language pipeline is to execute transforms written in another language; in my case executing transforms in the Python Beam SDK in a pipeline using the Java Beam SDK.

The Python Beam SDK would be running in an expansion service, and any transforms, standard or custom, would be accessed through the URN (path to the module). I believe this expansion service can be containerized, but I can't find the documentation so I may be wrong on that.

Therefore, if I want to execute arbitrary Python code, I would need to either find a high level transform capable of accepting and running arbitrary code, or write my own expansion service to handle executing the UDFs.

What I've tried

I have tried creating a simple WordCount pipeline in Java and running an expansion service locally in a Python virtual environment. I've been using unit tests to pass in a file and output the word counts which works fine in just Java.

However, I am having issues setting up the transforms like in the Beam documentation here. I am not able to import PythonExternalTransform and call to a method running there, so I have not successfully executed a multi-language pipeline yet.

For my local pipeline I am using 2.34.0 for the Apache Beam dependencies. I've tried up to version 2.43.0 but have not been able to get it to work.

Questions

  1. Are multi-language pipelines the best way to achieve executing Python UDFs in Dataflow? Would I need to write my own expansion service?
  2. If multi-language pipelines are not the best way, what other options are available to me, if any? Should I look further into creating a custom container image?
  • Yes, multi-language pipelines is the best way to achieve this. Can you look into this class which will allow you to execute an arbitrary element-wise Python function on a PCollection - https://github.com/apache/beam/blob/master/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/transforms/PythonMap.java – chamikara Jan 19 '23 at 14:04
  • Thank you for your response, @chamikara. I was able successfully get the SklearnMnistClassification pipeline from the examples to work in Dataflow (https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/SklearnMnistClassification.java). However, I am running into a lot of issues getting the multi-lang pipelines to run locally. The pipelines are either throwing an error, or never terminate. I've got the Python Expansion service running in a virtual environment, but I am wondering if I need a Docker image running as well? – Thomas Sluciak Jan 20 '23 at 15:54
  • You just need to have Docker installed. Can you try the guide here which has specific instructions for running locally using DirectRunner ? https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/ – chamikara Jan 21 '23 at 17:04
  • Yeah, I think I know what the issue is. I am not able to use Docker Desktop since we don't have a license. So I am using a workaround with Lima and Dockerd, so I think there's some miscommunication happening. I have established that I am able to run the Pipeline locally and its contacting the Python SDK in my venv. What seems to be happening is that it is not sending anything back to the Java pipeline. I'll update here if I am able to get things working – Thomas Sluciak Jan 23 '23 at 15:42
  • If this got resolved, appreciate if you can add an answer so that others who run into similar situations can benefit from it. – chamikara Jan 24 '23 at 16:54
  • I will close the question, but I wasn't able to get things working locally. I did establish that the Java pipeline was communicating with the Beam SDK in the virtual environment, but it seemed like the Python code was not handing anything back. I do appreciate the help, @chamikara – Thomas Sluciak Jan 25 '23 at 17:07

0 Answers0