long time lurker, first time poster.
I am currently trying to add execution of Python UDFs in a Java Dataflow pipeline. These UDFs would potentially employ third-party libraries and multiple modules.
I have been reading a lot about Apache Beam multi-language pipelines, but I am becoming less certain that is the way to achieve this. I'll first summarize my understanding of how multi-language pipelines are working (I'm kind of hoping I'm missing or misunderstanding something), before getting to my questions.
Beam Multi-language Pipelines
The main purpose of a multi-language pipeline is to execute transforms written in another language; in my case executing transforms in the Python Beam SDK in a pipeline using the Java Beam SDK.
The Python Beam SDK would be running in an expansion service, and any transforms, standard or custom, would be accessed through the URN (path to the module). I believe this expansion service can be containerized, but I can't find the documentation so I may be wrong on that.
Therefore, if I want to execute arbitrary Python code, I would need to either find a high level transform capable of accepting and running arbitrary code, or write my own expansion service to handle executing the UDFs.
What I've tried
I have tried creating a simple WordCount pipeline in Java and running an expansion service locally in a Python virtual environment. I've been using unit tests to pass in a file and output the word counts which works fine in just Java.
However, I am having issues setting up the transforms like in the Beam documentation here. I am not able to import PythonExternalTransform and call to a method running there, so I have not successfully executed a multi-language pipeline yet.
For my local pipeline I am using 2.34.0 for the Apache Beam dependencies. I've tried up to version 2.43.0 but have not been able to get it to work.
Questions
- Are multi-language pipelines the best way to achieve executing Python UDFs in Dataflow? Would I need to write my own expansion service?
- If multi-language pipelines are not the best way, what other options are available to me, if any? Should I look further into creating a custom container image?