0

I am trying to use 2 pyspark dataframe columns as an input to a nested dictionary to get the output as a new pyspark column. Also would want the solution to scale to a nested dictionary with 4-5 levels.

The dictionary is of the form: dict_prob={"a":{"x1":"y1","x2:y2"},"b":{"m1":"n1","m2":"n2"}}

Input Columns are:

index col1 col2
0 a x1
1 a x2
2 b m2

Output Column Needed

col3
y1
y2
n2

I tried the below links but these seem to work for a single dictionary and not for a nested dictionary. PySpark create new column with mapping from a dict How to use a column value as key to a dictionary in PySpark?

victorix17
  • 13
  • 2

1 Answers1

0

For the given example, you can use a simple udf :

from pyspark.sql.functions import udf

two_lvls = udf(lambda l1, l2: dict_prob[l1][l2])

df = df.withColumn("col3", two_lvls(df.col1, df.col2))

Output :

df.show()

+-----+----+----+----+
|index|col1|col2|col3|
+-----+----+----+----+
|    0|   a|  x1|  y1|
|    1|   a|  x2|  y2|
|    2|   b|  m2|  n2|
+-----+----+----+----+
Timeless
  • 22,580
  • 4
  • 12
  • 30