Cannot load user class error in Flink with Python user-defined table functions (UDTFs)

Question

I ran into a pretty strange behavior today. I'm using Flink 1.17.1 and PyFlink setting up something with the Kafka connector and Python user-defined table functions (UDTFs). I've found both a workaround and a solution in the end, I'm posting here for others to find it.

My code worked fine locally but when I tried running it in my docker Flink setup I ran into an error like:

Caused by: org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load user class: org.apache.flink.table.runtime.operators.python.table.PythonTableFunctionOperator
ClassLoader info: URL ClassLoader:
    file: '/tmp/tm_172.25.0.5:44065-a2ef4a/blobStorage/job_c5da7b3563558ec66d5e773659c8abe1/blob_p-18b059e5f10b72a375b507c7f72c8ab9931306f9-ae41178318456ae52391027abd82d3de' (valid JAR)
Class not resolvable through given classloader.

After much debugging I've narrowed it down to a simple reproducible example (the connector JAR is included in the project folder):

import json
import os

from pyflink.table import (DataTypes, TableEnvironment, EnvironmentSettings)
from pyflink.table.udf import udtf

t_env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())

# Works with these lines commented out and fails with them uncommented
# kafka_connector_path = os.path.abspath('jars/flink-sql-connector-kafka-1.17.1.jar')
# t_env.get_config().set("pipeline.jars", f"file://{kafka_connector_path}")

# define the source
table = t_env.from_elements(
    elements=[
        (1, '{"name": "Flink", "tel": 123, "addr": {"country": "Germany", "city": "Berlin"}}'),
        (2, '{"name": "hello", "tel": 135, "addr": {"country": "China", "city": "Shanghai"}}'),
        (3, '{"name": "world", "tel": 124, "addr": {"country": "USA", "city": "NewYork"}}'),
        (4, '{"name": "PyFlink", "tel": 32, "addr": {"country": "China", "city": "Hangzhou"}}')
    ],
    schema=['id', 'data'])

# execute sql statement
@udtf(result_types=[DataTypes.STRING(), DataTypes.INT(), DataTypes.STRING()])
def parse_data(data: str):
    json_data = json.loads(data)
    yield json_data['name'], json_data['tel'], json_data['addr']['country']

t_env.create_temporary_function('parse_data', parse_data)
t_env.execute_sql(
    """
    SELECT *
    FROM %s, LATERAL TABLE(parse_data(`data`)) t(name, tel, country)
    """ % table
).print()

If I uncomment the part with pipeline.jars it fails when I run it like:

flink run -py basic.py

However I discovered it works if I include the connector from the CLI instead of the code:

flink run -py basic.py --jarfile jars/flink-sql-connector-kafka-1.17.1.jar

After further digging I found the difference is there's 2 JARs on the taskmanager when it works:

oot@58c941f39b15:/opt/flink# ls -lah /tmp/tm_172.25.0.5\:33231-ecc45a/blobStorage/job_2de88f8af859fccdc956e62cc32c4a88/
total 37M
drwxr-xr-x  2 flink flink 4.0K Jul 12 14:24 .
drwxr-xr-x 40 flink flink 4.0K Jul 12 14:24 ..
-rw-r--r--  1 flink flink 5.4M Jul 12 14:24 blob_p-18b059e5f10b72a375b507c7f72c8ab9931306f9-d01b99c7749267191b1d6da2dfe3a8bc
-rw-r--r--  1 flink flink  32M Jul 12 14:24 blob_p-275820cb9b5e36c9f3e1e5483e93d0b808fe257e-5f41622e51ea7098f251bcfc3285b1bb

And just 1 JAR when it doesn't:

root@58c941f39b15:/opt/flink# ls -lah /tmp/tm_172.25.0.5\:33231-ecc45a/blobStorage/job_af184fc448ee510ac4ebbe92c7e7d893/
total 5.4M
drwxr-xr-x  2 flink flink 4.0K Jul 12 14:52 .
drwxr-xr-x 22 flink flink 4.0K Jul 12 14:52 ..
-rw-r--r--  1 flink flink 5.4M Jul 12 14:52 blob_p-18b059e5f10b72a375b507c7f72c8ab9931306f9-1fc054bb0990f0378d86166b1edd63ea

I figured out the 5.4M JAR is the Kafka connector from my project, while the 32M one is the flink-python-1.17.1.jar that I think should get implicitly included as a dependency.

Workaround

So one workaround is I can copy the flink-python jar into my project subfolder and specify both dependencies in the code and it works:

kafka_connector_path = os.path.abspath('jars/flink-sql-connector-kafka-1.17.1.jar')
flink_python_path = os.path.abspath('jars/flink-python-1.17.1.jar')
t_env.get_config().set("pipeline.jars", f"file://{kafka_connector_path};file://{flink_python_path}")

I haven't seen this flink-python JAR mentioned in the docs and I think it's because usually it gets implicitly uploaded to the task managers when doing flink run --python. However it seems adding other JAR dependencies through pipeline.jars will stop it from being auto-uploaded.

Feel free to answer/comment to add clarity on this behavior.

Solution

This might be the recommended solution: copying the flink-python JAR into the flink/lib folder of each jobmanager and taskmanager node. Make sure to restart the Flink processes in order to pick up the new JAR dependency.

cd flink
cp opt/flink-python-1.17.1.jar lib/

Cannot load user class error in Flink with Python user-defined table functions (UDTFs)

Workaround

Solution

0 Answers0