Spark Dataframe lambda on dataframe directly

Question

I see so many example which need to use lambda over a rdd.map .
just wonder if we can do something like the following :

df.withColumn('newcol',(lambda x: x['col1'] + x['col2'])).show()

what is the operation you need to perform? If you just want to sum up two columns then you can do it directly without using lambda. — Nikunj Kakadiya, Oct 26 '21 at 05:03
I just would like to know if it can be done directly using lambda over dataframe directly , instead of the need of rdd — mytabi, Oct 27 '21 at 01:29

score 3 · Accepted Answer · answered Oct 27 '21 at 08:45

3

You'll have to wrap it in a UDF and provide columns which you want your lambda to be applied on.

Example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()
    data = [{"a": 1, "b": 2}]
    df = spark.createDataFrame(data)
    df.withColumn("c", F.udf(lambda x, y: x + y)("a", "b")).show()

Result:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

answered Oct 27 '21 at 08:45

vladsiv

2,718
1
11
21

1

wow ~~ that's a very cool demonstration . many thanks – mytabi Oct 27 '21 at 12:37
@mytabi You're welcome! If this answers your question, please mark it as answered. – vladsiv Oct 27 '21 at 13:33

Spark Dataframe lambda on dataframe directly

1 Answers1