I have a pyspark dataframe df that has columns product_id and image. the product_id column is a unique identifier for the product. the image columns is a list of dictionaries containing the url and default setting for links to images associated with each product. the lists of dictionaries have the form:
[{"url":"path","default":true},{"url":"path2","default":false}]
some of the product_id in the dataframe df have multiple records with different lists, example:
+----------+----------------------------------------------------------------+
|product_id|image |
+----------+----------------------------------------------------------------+
|3 |[{"url":"path3","default":true},{"url":"path5","default":false}]|
+----------+----------------------------------------------------------------+
|3 |[{"url":"path7","default":true},{"url":"path8","default":false}]|
+----------+----------------------------------------------------------------+
I would like to create pyspark or spark sql to combine these multiple lists of dictionaries into one list for product_id in the dataframe that have more than one record. An example of the desired output dataframe is below:
+----------+----------------------------------------------------------------+
|product_id|image |
+----------+----------------------------------------------------------------+
|3 |[{"url":"path3","default":true},{"url":"path5","default":false},
+----------+-----------------------------------------------------------------
{"url":"path7","default":true},{"url":"path8","default":false}]|
---------------------------------------------------------------+
I've tried solutions like the one below:
# Explode the nested list of dictionaries
exploded_df = df.select(col("product_id"), explode(col("image")).alias("image"))
# Collect unique elements using collect_set
papi_image_merged_df2 = exploded_df.groupBy("product_id").agg(collect_set("image").alias("image"))
but it creates a nested list of lists as output when combining the records that have the same product_id:
+----------+----------------------------------------------------------------+
|product_id|image |
+----------+----------------------------------------------------------------+
|3 |[[{"url":"path3","default":true},{"url":"path5","default":false}],
+----------+-----------------------------------------------------------------
[{"url":"path7","default":true},{"url":"path8","default":false}]]|
-----------------------------------------------------------------+
can anyone suggest how to get an output like the desired output example I showed above, where it creates a list of just the unique dictionaries?