0

I have pyspark dataframe with 4 columns

1) Country
2) col1 [numeric]
3) col2 [numeric]
4) col3 [numeric]

I have udf which takes number and formats it to xx.xx [ 2 decimal points] using "withColumn" function I can call udf and format the numbers.

Example :

df=df.withColumn("col1", num_udf(df.col1))
df=df.withColumn("col2", num_udf(df.col2))
df=df.withColumn("col3", num_udf(df.col3)) 

What i m looking for can we run this udfs on each col parallelly, instead running in sequence.

  • The data is already distributed to the nodes, so the processing is faster, which also means there is no use of threaded approach here. – s510 Sep 13 '22 at 10:56

2 Answers2

0

Not sure why do you want to run it in parallel, but you can achieve it by using rdd and map:

temp = spark.createDataFrame(
    [(1, 2, 3)],
    schema=['col1', 'col2', 'col3']
)

temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |2   |3   |
+----+----+----+

# You can replace +1 to your udf in the lambda
temp = temp.rdd.map(
    lambda row: (row[0]+ 1, row[1] + 1, row[2] + 1)
).toDF(['col1', 'col2', 'col3'])

temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|2   |3   |4   |
+----+----+----+
Jonathan Lam
  • 1,761
  • 2
  • 8
  • 17
0

You can also create udf function from a python function like below:

from pyspark.sql.functions import udf
def formatNumber(x):
    if x is not None :
        return "%0.2f"%x
    else:
        return None

formatNumberUdf = udf(formatNumber)

df=df.withColumn("col1", formatNumberUdf('col1'))
df=df.withColumn("col2", formatNumberUdf('col2'))
df=df.withColumn("col3", formatNumberUdf('col3'))