0

I have a problem with a large dataframe (40 billions+ rows) where I am creating a new column passing an array column to a UDF. A PySpark UDF works for a smaller dataset and not working for more than few thousand records. So I am trying pandas_udf with apache arrow.

group_key col1 col2
123       a    5
123       a    6
123       b    6
123       cd   3 
123       d    2
123       ab   9
456       d    4  
456       ad   6 
456       ce   7 
456       a    4 
456       s    3 

Desired output

group_key output
123       9.2
456       7.3

This is a sample code, where I am passing an array of arrays to the UDF. arr_of_arr in func is showing as numpy ndarray. I return StringType() here.

df_arr2=df_new2.groupBy("group_key").agg(F.collect_list( F.array("col1", col2")).alias("key_value_arr"))
    

|group_key|key_value_arr                                                                                                                              
|123       |[[a, 5], [a, 6], [b, 6], [cd, 3], [d 2], [ab, 9]]|
|456       |[[d, 4], [ad, 6], [ce, 7], [a, 4], [s, 3]]|



df_arr2.printSchema()

@pandas_udf(StringType())
def func(arr_of_arr):
    res=''
    # arr_of_arr.values is 'numpy.ndarray' 
    
    for index, value in np.ndenumerate(arr_of_arr):  ## arr_of_arr has only 1 row with array objects . 0 is the index , value is the array of arrays objects
        for index1, value1 in value.tolist():
            res=str(value1) ## Just returning the last value here
    return pd.Series(res)

df_arr2.withColumn('col3',func(df_arr2.key_value_arr)).show(truncate=False)

This code returns correct value from the inside array elements like "5", "7" etc. for few records. For full dataset it errors probably performance related .

An error occurred while calling o480.showString.
  File "/hadoop/3/yarn/local/usercache/b_incdata_rw/appcache/application_1660704390900_2339796/container_e3797_1660704390900_2339796_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py",    

line 328, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o480.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 128

sample output:
root
 |-- group_key: long (nullable = true)
 |   |-- key_value_arr: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

group_key|key_value_arr|col3|
|123       |[[a, 5], [a, 6], [b, 6], [cd, 3], [d, 2], [ab, 9]]|9|

The UDF logic is much more complex than this. But I need the values from each array element as key and value and loop through them. Performance does not look okay for such logic. Any suggestion ?

  • Does it have to be an UDF or you'd have an option to convert it to "Spark way"? Many people said their logic is "complicated" as an excuse for their laziness to do the conversion. – pltc Sep 19 '22 at 09:08

0 Answers0