In databricks I have N delta tables of stores with their products with this schema:
store_1:
store | product | sku |
---|---|---|
1 | prod 1 | abc |
1 | prod 2 | def |
1 | prod 3 | ghi |
store_2:
store | product | sku |
---|---|---|
2 | prod 1 | abc |
2 | prod 10 | xyz |
2 | prod 23 | ghi |
I need to find every every identical product from the store 1 in the store 2, using the SKU column, so I'm running a query like this:
select * from df_store_1 st1 join df_store_2 st2 on st1.sku=st2.sku
That should return this product:
product | sku |
---|---|
prod 1 | abc |
I need to do this for all the stores and products, and since I am a beginner with PySpark I was thinking to create a list of all the possible pairs of stores and traverse them like this:
list_dfs = []
for store1, store2 in list_all_pairs_stores:
temp_df = spark.sql(“select st1.product, st2.product, st1.sku from store1 st1 join store2 st2 on st1.sku=st2.sku”).toPandas()
list_dfs.append(temp_df)
all_equal_products = pd.concat(list_dfs, axis=1)
What would be an efficient way to parallelize these queries?