A simplified version of my problem is this:
I have a Spark DataFrame ("my_df") with one column ("col1") and values 'a','b','c','d'
and a dictionary ("my_dict") like this: {'a':5, 'b':7', 'c':2, 'd':4}
I would like to combine these to create a DataFrame with an additional column containing the corresponding values from my_dict.
At the moment I am using the following method, which works for a small dataset, but it's very inefficient, and it causes a StackOverflowError on my full dataset
import pyspark.sql.functions as F
# start with an arbitrary df containing "col1"
# initialise new column with zeros
my_df = my_df.withColumn('dict_data', F.lit(0))
for k,v in my_dict.items():
my_df = my_df.withColumn('dict_data',
F.when((my_df['col1']==k),
v).otherwise(df['dict_data'])
)
Is there a better way to do this? I've tried using Window functions but I've had difficult applying it in this context...