I have a problem with a large dataframe (40 billions+ rows) where I am creating a new column passing an array column to a UDF. A PySpark UDF works for a smaller dataset and not working for more than few thousand records. So I am trying pandas_udf with apache arrow.
group_key col1 col2
123 a 5
123 a 6
123 b 6
123 cd 3
123 d 2
123 ab 9
456 d 4
456 ad 6
456 ce 7
456 a 4
456 s 3
Desired output
group_key output
123 9.2
456 7.3
This is a sample code, where I am passing an array of arrays to the UDF. arr_of_arr in func is showing as numpy ndarray. I return StringType() here.
df_arr2=df_new2.groupBy("group_key").agg(F.collect_list( F.array("col1", col2")).alias("key_value_arr"))
|group_key|key_value_arr
|123 |[[a, 5], [a, 6], [b, 6], [cd, 3], [d 2], [ab, 9]]|
|456 |[[d, 4], [ad, 6], [ce, 7], [a, 4], [s, 3]]|
df_arr2.printSchema()
@pandas_udf(StringType())
def func(arr_of_arr):
res=''
# arr_of_arr.values is 'numpy.ndarray'
for index, value in np.ndenumerate(arr_of_arr): ## arr_of_arr has only 1 row with array objects . 0 is the index , value is the array of arrays objects
for index1, value1 in value.tolist():
res=str(value1) ## Just returning the last value here
return pd.Series(res)
df_arr2.withColumn('col3',func(df_arr2.key_value_arr)).show(truncate=False)
This code returns correct value from the inside array elements like "5", "7" etc. for few records. For full dataset it errors probably performance related .
An error occurred while calling o480.showString. File "/hadoop/3/yarn/local/usercache/b_incdata_rw/appcache/application_1660704390900_2339796/container_e3797_1660704390900_2339796_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py",
line 328, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o480.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 128
sample output:
root
|-- group_key: long (nullable = true)
| |-- key_value_arr: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
group_key|key_value_arr|col3|
|123 |[[a, 5], [a, 6], [b, 6], [cd, 3], [d, 2], [ab, 9]]|9|
The UDF logic is much more complex than this. But I need the values from each array element as key and value and loop through them. Performance does not look okay for such logic. Any suggestion ?