PySpark: Attach dictionary data to PySpark DataFrame

Question

A simplified version of my problem is this:

I have a Spark DataFrame ("my_df") with one column ("col1") and values 'a','b','c','d'

and a dictionary ("my_dict") like this: {'a':5, 'b':7', 'c':2, 'd':4}

I would like to combine these to create a DataFrame with an additional column containing the corresponding values from my_dict.

At the moment I am using the following method, which works for a small dataset, but it's very inefficient, and it causes a StackOverflowError on my full dataset

import pyspark.sql.functions as F

# start with an arbitrary df containing "col1"
# initialise new column with zeros
my_df = my_df.withColumn('dict_data', F.lit(0))

for k,v in my_dict.items():
    my_df = my_df.withColumn('dict_data',
                             F.when((my_df['col1']==k),
                                     v).otherwise(df['dict_data'])
                             )

Is there a better way to do this? I've tried using Window functions but I've had difficult applying it in this context...

score 1 · Accepted Answer · answered Jun 19 '18 at 11:01

1

You just need to map your dictionary values into a new column based on the values of your first column. You can refer to :

pyspark create new column with mapping from a dict

answered Jun 19 '18 at 11:01

pissall

7,109
2
25
45

score 1 · Answer 2 · answered Jun 19 '18 at 11:20

1

You can do it with an intermediate dataframe and a join:

rows = [{'col1': key, 'dict_data': value} for key,value in my_dict.items()]
my_dict_df = rdd.parallelize(rows).toDF()

result_df = my_df.join(my_dict_df, 'col1', 'left')

answered Jun 19 '18 at 11:20

Pierre Gourseaud

2,347
13
24

PySpark: Attach dictionary data to PySpark DataFrame

2 Answers2