I ran into a pretty strange behavior today. I'm using Flink 1.17.1 and PyFlink setting up something with the Kafka connector and Python user-defined table functions (UDTFs). I've found both a workaround and a solution in the end, I'm posting here for others to find it.
My code worked fine locally but when I tried running it in my docker Flink setup I ran into an error like:
Caused by: org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load user class: org.apache.flink.table.runtime.operators.python.table.PythonTableFunctionOperator
ClassLoader info: URL ClassLoader:
file: '/tmp/tm_172.25.0.5:44065-a2ef4a/blobStorage/job_c5da7b3563558ec66d5e773659c8abe1/blob_p-18b059e5f10b72a375b507c7f72c8ab9931306f9-ae41178318456ae52391027abd82d3de' (valid JAR)
Class not resolvable through given classloader.
After much debugging I've narrowed it down to a simple reproducible example (the connector JAR is included in the project folder):
import json
import os
from pyflink.table import (DataTypes, TableEnvironment, EnvironmentSettings)
from pyflink.table.udf import udtf
t_env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())
# Works with these lines commented out and fails with them uncommented
# kafka_connector_path = os.path.abspath('jars/flink-sql-connector-kafka-1.17.1.jar')
# t_env.get_config().set("pipeline.jars", f"file://{kafka_connector_path}")
# define the source
table = t_env.from_elements(
elements=[
(1, '{"name": "Flink", "tel": 123, "addr": {"country": "Germany", "city": "Berlin"}}'),
(2, '{"name": "hello", "tel": 135, "addr": {"country": "China", "city": "Shanghai"}}'),
(3, '{"name": "world", "tel": 124, "addr": {"country": "USA", "city": "NewYork"}}'),
(4, '{"name": "PyFlink", "tel": 32, "addr": {"country": "China", "city": "Hangzhou"}}')
],
schema=['id', 'data'])
# execute sql statement
@udtf(result_types=[DataTypes.STRING(), DataTypes.INT(), DataTypes.STRING()])
def parse_data(data: str):
json_data = json.loads(data)
yield json_data['name'], json_data['tel'], json_data['addr']['country']
t_env.create_temporary_function('parse_data', parse_data)
t_env.execute_sql(
"""
SELECT *
FROM %s, LATERAL TABLE(parse_data(`data`)) t(name, tel, country)
""" % table
).print()
If I uncomment the part with pipeline.jars it fails when I run it like:
flink run -py basic.py
However I discovered it works if I include the connector from the CLI instead of the code:
flink run -py basic.py --jarfile jars/flink-sql-connector-kafka-1.17.1.jar
After further digging I found the difference is there's 2 JARs on the taskmanager when it works:
oot@58c941f39b15:/opt/flink# ls -lah /tmp/tm_172.25.0.5\:33231-ecc45a/blobStorage/job_2de88f8af859fccdc956e62cc32c4a88/
total 37M
drwxr-xr-x 2 flink flink 4.0K Jul 12 14:24 .
drwxr-xr-x 40 flink flink 4.0K Jul 12 14:24 ..
-rw-r--r-- 1 flink flink 5.4M Jul 12 14:24 blob_p-18b059e5f10b72a375b507c7f72c8ab9931306f9-d01b99c7749267191b1d6da2dfe3a8bc
-rw-r--r-- 1 flink flink 32M Jul 12 14:24 blob_p-275820cb9b5e36c9f3e1e5483e93d0b808fe257e-5f41622e51ea7098f251bcfc3285b1bb
And just 1 JAR when it doesn't:
root@58c941f39b15:/opt/flink# ls -lah /tmp/tm_172.25.0.5\:33231-ecc45a/blobStorage/job_af184fc448ee510ac4ebbe92c7e7d893/
total 5.4M
drwxr-xr-x 2 flink flink 4.0K Jul 12 14:52 .
drwxr-xr-x 22 flink flink 4.0K Jul 12 14:52 ..
-rw-r--r-- 1 flink flink 5.4M Jul 12 14:52 blob_p-18b059e5f10b72a375b507c7f72c8ab9931306f9-1fc054bb0990f0378d86166b1edd63ea
I figured out the 5.4M JAR is the Kafka connector from my project, while the 32M one is the flink-python-1.17.1.jar that I think should get implicitly included as a dependency.
Workaround
So one workaround is I can copy the flink-python jar into my project subfolder and specify both dependencies in the code and it works:
kafka_connector_path = os.path.abspath('jars/flink-sql-connector-kafka-1.17.1.jar')
flink_python_path = os.path.abspath('jars/flink-python-1.17.1.jar')
t_env.get_config().set("pipeline.jars", f"file://{kafka_connector_path};file://{flink_python_path}")
I haven't seen this flink-python JAR mentioned in the docs and I think it's because usually it gets implicitly uploaded to the task managers when doing flink run --python
. However it seems adding other JAR dependencies through pipeline.jars
will stop it from being auto-uploaded.
Feel free to answer/comment to add clarity on this behavior.
Solution
This might be the recommended solution: copying the flink-python JAR into the flink/lib
folder of each jobmanager and taskmanager node. Make sure to restart the Flink processes in order to pick up the new JAR dependency.
cd flink
cp opt/flink-python-1.17.1.jar lib/