I am trying to pass a custom Python class object to a UDF in PySpark. I do not want a new instance of the object created for every row that it processes since it needs to make an expensive API call to get a secret key. My thinking is to first make the API call when instantiating the object, and then pass that object to tasks. Ideally all executors make use of the same object / a copy of it.
I also make use of an external library whose object is not serializable. It's less of a concern if this has to be instantiated multiple times.
The class looks like this:
class MyClass(object):
def __init__(self, arg1):
self.secret = some_api_call(arg1)
self.third_party_obj = None
def set_3rd_party_obj(self):
self.third_party_obj = third_party_lib(self.secret)
def do_thing(self, val):
return self.third_party_obj(val)
In PySpark, this is what I am attempting:
my_obj = MyClass("arg1")
my_udf = udf(lambda a, b: b.do_thing(a))
df = spark.read.parquet(inputUri)
df = df.withColumn("col1", my_udf(col("col2"), lit(my_obj)))
However, I get AttributeError: 'MyClass' object has no attribute '_get_object_id'
. If I try to broadcast my_obj, I get AttributeError: 'Broadcast' object has no attribute '_get_object_id'
(trace below).
What does work is if I make the call for the secret outside, and then instantiate a new object in the UDF and pass that in (modifying it so that set_3rd_party_obj
is called in the init). However, I want to keep the secret abstracted away in this class. I split set_3rd_party_obj
out (not called in init) in the hopes that I could check whether it's been initialized in the UDF before initializing it again to avoid repeated work. At this stage I haven't even got that far since just passing an object with a couple of standard typed variables is throwing an error.
I'd be grateful for any pointers you could give either around how to pass the object to the UDF successfully or if there's a better way to accomplish this.
Stack trace:
my_udf(col("col2"), lit(my_obj))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 44, in _
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in <listcomp>
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'Broadcast' object has no attribute '_get_object_id'