I have a row-wise operation I wish to perform on my dataframe which takes in some fixed variables as parameters. The only way I know how to do this is with the use of nested functions. I'm trying to use Cython to compile a portion of my code, then call the Cython function from within mapPartitions, but it raised the error PicklingError: Can't pickle <cyfunction outer_function.<locals>._nested_function at 0xfffffff>
.
When using pure Python, I do
def outer_function(fixed_var_1, fixed_var_2):
def _nested_function(partition):
for row in partition:
yield dosomething(row, fixed_var_1, fixed_var_2)
return _nested_function
output_df = input_df.repartition(some_col).rdd \
.mapPartitions(outer_function(a, b))
Right now I have outer_function
defined in a separate file, like this
# outer_func.pyx
def outer_function(fixed_var_1, fixed_var_2):
def _nested_function(partition):
for row in partition:
yield dosomething(row, fixed_var_1, fixed_var_2)
return _nested_function
and this
# runner.py
from outer_func import outer_function
output_df = input_df.repartition(some_col).rdd \
.mapPartitions(outer_function(a, b))
And this throws the pickling error above.
I've looked at https://docs.databricks.com/user-guide/faq/cython.html and tried to get outer_function
. Still, the same error occurs. The problem is that the nested function does not appear in the global space of the module, thus it cannot be found and serialized.
I've also tried doing this
def outer_function(fixed_var_1, fixed_var_2):
global _nested_function
def _nested_function(partition):
for row in partition:
yield dosomething(row, fixed_var_1, fixed_var_2)
return _nested_function
This throws a different error AttributeError: 'module' object has no attribute '_nested_function'
.
Is there any way of not using nested function in this case? Or is there another way I can make the nested function "serializable"?
Thanks!
EDIT: I also tried doing
# outer_func.pyx
class PartitionFuncs:
def __init__(self, fixed_var_1, fixed_var_2):
self.fixed_var_1 = fixed_var_1
self.fixed_var_2 = fixed_var_2
def nested_func(self, partition):
for row in partition:
yield dosomething(row, self.fixed_var_1, self.fixed_var_2)
# main.py
from outer_func import PartitionFuncs
p_funcs = PartitionFuncs(a, b)
output_df = input_df.repartition(some_col).rdd \
.mapPartitions(p_funcs.nested_func)
And still I get PicklingError: Can't pickle <cyfunction PartitionFuncs.nested_func at 0xfffffff>
. Oh well, the idea didn't work.