0

I have a problem in Pyspark creating a column based on values in another column for a new dataframe. It's boring and seems to me not a good practice to use a lot of

CASE
  WHEN column_a = 'value_1' THEN 'value_x'
  WHEN column_a = 'value_2' THEN 'value_y'
  ...  
  WHEN column_a = 'value_289' THEN 'value_xwerwz'
END

In cases like this, in python, I get used to using a dict or, even better, a configparser file and avoid the if else condition. I just pass the key and python returns the desired value. Also, we have a 'fallback' option for ELSE clause.

The problem seems to me that we are not treating a single row but all of them in one command, so using dict/map/configparser is an unavailable option. I thought about using a loop with dict, but it seems too slow and a waste of computation as we repeat all the conditions.

I'm still looking for this practice, if I find it, I'll post it here. But, you know, probably a lot of people already use it and I don't know yet. But if there is no other way, ok. Use many WHEN THEN conditions won't be a choice.

Thank you

I tried to use a dict and searched for solutions like this

Ronak Jain
  • 3,073
  • 1
  • 11
  • 17

2 Answers2

1

You could create a function which converts a dict into a Spark F.when, e.g.:

import pyspark.sql.functions as F

def create_spark_when(column, conditions, default):
    when = None
    for key, value in conditions.items():
        current_when = F.when(F.col(column) == key, value)
        if when is None:
            when = current_when.otherwise(default)
        else:
            when = current_when.otherwise(when)
    return when

df = spark.createDataFrame([(0,), (1,), (2,)])
df.show()
my_conditions = {1: "a", 2: "b"}
my_default = "c"
df.withColumn(
    "my_column",
    create_spark_when("_1", my_conditions, my_default),
).show()

Output:

+---+
| _1|
+---+
|  0|
|  1|
|  2|
+---+

+---+---------+
| _1|my_column|
+---+---------+
|  0|        c|
|  1|        a|
|  2|        b|
+---+---------+
hayj
  • 1,159
  • 13
  • 21
0

One choice is to use create a dataframe out of dictionary and perform join

This would work:

Creating a Dataframe:

dict={"value_1": "value_x", "value_2": "value_y"}
dict_df=spark.createDataFrame([(k,v) for k,v in dict.items()], ["key","value"])

Performing the join:

df.alias("df1")\
.join(F.broadcast(dict_df.alias("df2")), F.col("column_a")==F.col("key"))\
.selectExpr("df1.*","df2.value as newColumn")\
.show()

We can broadcast the dict_df as it is small.

Input:

Input

Dict_df:

Dict

Output:

Output

Alternatively, you can use a UDF - but that is not recommended.

Ronak Jain
  • 3,073
  • 1
  • 11
  • 17