How to run UDF in parallel with pyspark data frame on more than one cols

Question

I have pyspark dataframe with 4 columns

1) Country
2) col1 [numeric]
3) col2 [numeric]
4) col3 [numeric]

I have udf which takes number and formats it to xx.xx [ 2 decimal points] using "withColumn" function I can call udf and format the numbers.

Example :

df=df.withColumn("col1", num_udf(df.col1))
df=df.withColumn("col2", num_udf(df.col2))
df=df.withColumn("col3", num_udf(df.col3))

What i m looking for can we run this udfs on each col parallelly, instead running in sequence.

The data is already distributed to the nodes, so the processing is faster, which also means there is no use of threaded approach here. — s510, Sep 13 '22 at 10:56

score 0 · Answer 1 · answered Sep 13 '22 at 07:10

Not sure why do you want to run it in parallel, but you can achieve it by using rdd and map:

temp = spark.createDataFrame(
    [(1, 2, 3)],
    schema=['col1', 'col2', 'col3']
)

temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |2   |3   |
+----+----+----+

# You can replace +1 to your udf in the lambda
temp = temp.rdd.map(
    lambda row: (row[0]+ 1, row[1] + 1, row[2] + 1)
).toDF(['col1', 'col2', 'col3'])

temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|2   |3   |4   |
+----+----+----+

score 0 · Answer 2 · answered Sep 13 '22 at 08:18

You can also create udf function from a python function like below:

from pyspark.sql.functions import udf
def formatNumber(x):
    if x is not None :
        return "%0.2f"%x
    else:
        return None

formatNumberUdf = udf(formatNumber)

df=df.withColumn("col1", formatNumberUdf('col1'))
df=df.withColumn("col2", formatNumberUdf('col2'))
df=df.withColumn("col3", formatNumberUdf('col3'))

How to run UDF in parallel with pyspark data frame on more than one cols

2 Answers2