1

A simplified version of my problem is this:

I have a Spark DataFrame ("my_df") with one column ("col1") and values 'a','b','c','d'

and a dictionary ("my_dict") like this: {'a':5, 'b':7', 'c':2, 'd':4}

I would like to combine these to create a DataFrame with an additional column containing the corresponding values from my_dict.

At the moment I am using the following method, which works for a small dataset, but it's very inefficient, and it causes a StackOverflowError on my full dataset

import pyspark.sql.functions as F

# start with an arbitrary df containing "col1"
# initialise new column with zeros
my_df = my_df.withColumn('dict_data', F.lit(0))

for k,v in my_dict.items():
    my_df = my_df.withColumn('dict_data',
                             F.when((my_df['col1']==k),
                                     v).otherwise(df['dict_data'])
                             )

Is there a better way to do this? I've tried using Window functions but I've had difficult applying it in this context...

Dan
  • 1,575
  • 1
  • 11
  • 17

2 Answers2

1

You just need to map your dictionary values into a new column based on the values of your first column. You can refer to :

pyspark create new column with mapping from a dict

pissall
  • 7,109
  • 2
  • 25
  • 45
1

You can do it with an intermediate dataframe and a join:

rows = [{'col1': key, 'dict_data': value} for key,value in my_dict.items()]
my_dict_df = rdd.parallelize(rows).toDF()

result_df = my_df.join(my_dict_df, 'col1', 'left')
Pierre Gourseaud
  • 2,347
  • 13
  • 24