Pyspark: how to replace column values with dict when types of the keys and values are different

Question

I have a pySpark dataframe with a column of integers. I also have a mapping dict from integers to strings like

{1: 'A', 
 2: 'B', 
 3: 'C'}

I would like to get a new column from the original column using this mapping. How to do this?

I tried to use the replace function, but it casts the new values into the same datatype as the original. I think I could first cast the integers into strings, but it would be nice to know a more general way to do this. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.replace.html

I am a newbie with pySpark and probably just missing something very simple. :) Thanks for the help in advance!

You can have a look here https://stackoverflow.com/questions/42980704/pyspark-create-new-column-with-mapping-from-a-dict — ScootCork, Jul 06 '22 at 12:01
[this](https://stackoverflow.com/q/72865766/8279585) might be helpful. it helps create a new column which will preserve the new data type. — samkart, Jul 06 '22 at 14:31
Thanks @ScootCork , a mapping expression created by `create_map` worked nicely! — RVa, Jul 07 '22 at 06:28

score 0 · Answer 1 · answered Jul 06 '22 at 15:03

a when().otherwise() can be used here, chained with reduce() (from functools).

data_sdf = spark.range(3).toDF('num')

# +---+
# |num|
# +---+
# |  0|
# |  1|
# |  2|
# +---+

replace_dict = {
    1: 'A', 
    2: 'B', 
    3: 'C'
}

We can use the reduce() to chain the when() statements.

when_statement = reduce(lambda x, y: x.when(func.col('num') == y, replace_dict[y]), 
                        replace_dict.keys(), 
                        func.when(func.col('num') == None, func.lit(None))
                        ). \
    otherwise(func.lit(None))

print(when_statement)
# Column<'CASE WHEN (num = NULL) THEN NULL WHEN (num = 1) THEN A WHEN (num = 2) THEN B WHEN (num = 3) THEN C ELSE NULL END'>

data_sdf. \
    withColumn('replaced_vals', when_statement). \
    show()

# +---+-------------+
# |num|replaced_vals|
# +---+-------------+
# |  0|         null|
# |  1|            A|
# |  2|            B|
# +---+-------------+

reduce() applies a function to an iterable recursively, and its signature is reduce(function, iterable[, initializer]), meaning first the function which is our when() statement, then comes the iterable or our dictionary keys which will be used to pull replacing values from the dictionary recursively. Last part is optional but important in this case - it is the initial value that is at the top of the chain. In this case, because we wanted a func.when().when()....otherwise(), we passed the first func.when() as an initial value and the rest of them will be chained recursively using the function.

Thanks for the answer @samkart ! I solved my case using `create_map`, but I will check this out and test to learn more! — RVa, Jul 07 '22 at 06:47

Pyspark: how to replace column values with dict when types of the keys and values are different

1 Answers1