How to use multiple columns as maps for nested dictionary to create a new dataframe column?

Question

I am trying to use 2 pyspark dataframe columns as an input to a nested dictionary to get the output as a new pyspark column. Also would want the solution to scale to a nested dictionary with 4-5 levels.

The dictionary is of the form: dict_prob={"a":{"x1":"y1","x2:y2"},"b":{"m1":"n1","m2":"n2"}}

Input Columns are:

index	col1	col2
0	a	x1
1	a	x2
2	b	m2

Output Column Needed

col3
y1
y2
n2

I tried the below links but these seem to work for a single dictionary and not for a nested dictionary. PySpark create new column with mapping from a dict How to use a column value as key to a dictionary in PySpark?

score 0 · Accepted Answer · answered Jun 05 '23 at 12:02

For the given example, you can use a simple udf :

from pyspark.sql.functions import udf

two_lvls = udf(lambda l1, l2: dict_prob[l1][l2])

df = df.withColumn("col3", two_lvls(df.col1, df.col2))

Output :

df.show()

+-----+----+----+----+
|index|col1|col2|col3|
+-----+----+----+----+
|    0|   a|  x1|  y1|
|    1|   a|  x2|  y2|
|    2|   b|  m2|  n2|
+-----+----+----+----+

How to use multiple columns as maps for nested dictionary to create a new dataframe column?

1 Answers1