In Pyspark how do I compare two columns and use x whenever they are not the same

Question

How would I compare two columns and say that I want to use x column when they are not the same This is what I'm doing right now.

 SUMMARY = SUMMARY.withColumn("type_description", F.when((SUMMARY.type_description != SUMMARY.rename_description), F.lit("rename_description")

sophocles · Answer 1 · 2021-11-23T08:25:54.467

IIUC, you want to compare when two columns whether they are the same and return the value of y column if not, and value of x column if they are. If that, you can use when and col from pyspark.sql.functions:

from pyspark.sql.functions import when, col

df = df.withColumn('type_description_new',
                   when(col('type_description')!=col('rename_description'),
                                               col('rename_description')).otherwise(col('type_description')))

Result & Setup:

df = spark.createDataFrame(
    [(1,1), (2,2),(3,4)], 
    ['type_description', 'rename_description']
)

>>> df.show(truncate=False)
+----------------+------------------+--------------------+
|type_description|rename_description|type_description_new|
+----------------+------------------+--------------------+
|               1|                 1|                   1|
|               2|                 2|                   2|
|               3|                 4|                   4|
+----------------+------------------+--------------------+

score 2 · Answer 2 · answered Nov 22 '21 at 17:49

You're very close, you were just missing the .otherwise call.

from pyspark.sql import types as T, functions as F, SparkSession

spark = SparkSession.builder.getOrCreate()


# Synthesize DataFrames
schema = T.StructType([
  T.StructField("type_description", T.StringType(), False),
  T.StructField("rename_description", T.StringType(), False),
  T.StructField("col_3", T.StringType(), False),
  T.StructField("col_4", T.IntegerType(), False),
])
data = [
  {"type_description": "key_1", "rename_description": "key_2", "col_3": "CREATE", "col_4": 0},
  {"type_description": "key_2", "rename_description": "key_2", "col_3": "CREATE", "col_4": 0},
  {"type_description": "key_3", "rename_description": "OVERRIDE", "col_3": "CREATE", "col_4": 0},
]

df = spark.createDataFrame(data, schema)
df.show()
"""
+----------------+------------------+------+-----+
|type_description|rename_description| col_3|col_4|
+----------------+------------------+------+-----+
|           key_1|             key_2|CREATE|    0|
|           key_2|             key_2|CREATE|    0|
|           key_3|          OVERRIDE|CREATE|    0|
+----------------+------------------+------+-----+
"""

SUMMARY_DF = df.withColumn(
  "final_description",
  F.when(
    df.type_description == df.rename_description,
    df.type_description
  ).otherwise(
    df.rename_description
  )
)
SUMMARY_DF.show()
"""
+----------------+------------------+------+-----+-----------------+
|type_description|rename_description| col_3|col_4|final_description|
+----------------+------------------+------+-----+-----------------+
|           key_1|             key_2|CREATE|    0|            key_2|
|           key_2|             key_2|CREATE|    0|            key_2|
|           key_3|          OVERRIDE|CREATE|    0|         OVERRIDE|
+----------------+------------------+------+-----+-----------------+
"""

jackfischer · Accepted Answer · 2021-12-29T22:36:18.720

1

You can use a column expression in place of your literal. This would look very close to what you have,

SUMMARY = SUMMARY.withColumn(
    "type_description",
    F.when(SUMMARY.type_description != SUMMARY.rename_description, SUMMARY.x).otherwise(
        SUMMARY.type_description
    )
)

edited Dec 29 '21 at 22:36

answered Nov 22 '21 at 17:46

jackfischer

86
2
8

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 22 '21 at 18:59

In Pyspark how do I compare two columns and use x whenever they are not the same

3 Answers3