0

I have a pyspark dataframe df that has columns product_id and image. the product_id column is a unique identifier for the product. the image columns is a list of dictionaries containing the url and default setting for links to images associated with each product. the lists of dictionaries have the form:

[{"url":"path","default":true},{"url":"path2","default":false}]

some of the product_id in the dataframe df have multiple records with different lists, example:

+----------+----------------------------------------------------------------+
|product_id|image                                                           |
+----------+----------------------------------------------------------------+
|3         |[{"url":"path3","default":true},{"url":"path5","default":false}]|
+----------+----------------------------------------------------------------+
|3         |[{"url":"path7","default":true},{"url":"path8","default":false}]|
+----------+----------------------------------------------------------------+

I would like to create pyspark or spark sql to combine these multiple lists of dictionaries into one list for product_id in the dataframe that have more than one record. An example of the desired output dataframe is below:

+----------+----------------------------------------------------------------+
|product_id|image                                                           |
+----------+----------------------------------------------------------------+
|3         |[{"url":"path3","default":true},{"url":"path5","default":false},
+----------+-----------------------------------------------------------------
{"url":"path7","default":true},{"url":"path8","default":false}]|
---------------------------------------------------------------+

I've tried solutions like the one below:

# Explode the nested list of dictionaries
exploded_df = df.select(col("product_id"), explode(col("image")).alias("image"))

# Collect unique elements using collect_set
papi_image_merged_df2 = exploded_df.groupBy("product_id").agg(collect_set("image").alias("image"))

but it creates a nested list of lists as output when combining the records that have the same product_id:

+----------+----------------------------------------------------------------+
|product_id|image                                                           |
+----------+----------------------------------------------------------------+
|3         |[[{"url":"path3","default":true},{"url":"path5","default":false}],
+----------+-----------------------------------------------------------------
[{"url":"path7","default":true},{"url":"path8","default":false}]]|
-----------------------------------------------------------------+

can anyone suggest how to get an output like the desired output example I showed above, where it creates a list of just the unique dictionaries?

user3476463
  • 3,967
  • 22
  • 57
  • 117
  • Does this answer your question? [GroupBy and concat array columns pyspark](https://stackoverflow.com/questions/48406304/groupby-and-concat-array-columns-pyspark) – Emma Aug 28 '23 at 21:30

1 Answers1

0

Try this:

import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
  (3, [{'url': 'path1', 'default': True}, {'url': 'path2', 'default': False}]),
  (3, [{'url': 'path3', 'default': True}, {'url': 'path4', 'default': False}])
], ['product_id', 'image'])

df = (
  df
  .groupBy('product_id')
  .agg(f.collect_list(f.col('image')).alias('images'))
  .select('product_id', f.flatten(f.col('images')).alias('images'))
)

df.show(truncate= False)

And the output is:

+----------+--------------------------------------------------------------------------------------------------------------------------------------+
|product_id|images                                                                                                                                |
+----------+--------------------------------------------------------------------------------------------------------------------------------------+
|3         |[{default -> true, url -> path1}, {default -> false, url -> path2}, {default -> true, url -> path3}, {default -> false, url -> path4}]|
+----------+--------------------------------------------------------------------------------------------------------------------------------------+
ARCrow
  • 1,360
  • 1
  • 10
  • 26
  • thank you for getting back to me with this suggestion. I tried it but I get the error: "cannot resolve 'flatten(image)' due to data type mismatch: The argument should be an array of arrays, but 'image' is of array type.;" do you have a suggestion how to fix/cast to a different data type? – user3476463 Aug 29 '23 at 15:16
  • @user3476463, maybe when you transported the code, instead of `flatten('images')`, you put `flatten('image')`? – ARCrow Aug 29 '23 at 16:17