I have a Dataframe with rows which will be saved to different target tables. Right now, I'm finding the unique combination of parameters to determine the target table, iterating over the Dataframe and filtering, then writing.
Something similar to this:
df = spark.load.json(directory).repartition('client', 'region')
unique_clients_regions = [(group.client, group.region) for group in df.select('client', 'region').distinct().collect()]
for client, region in unique_clients_regions:
(df
.filter(f"client = '{client}' and region = '{region}'")
.select(
...
)
.write.mode("append")
.saveAsTable(f"{client}_{region}_data")
)
Is there a way to map the write operation to different groupBy
groups instead of having to iterate over the distinct set? I made sure to repartition by client
and region
to try and speed up performance of the filter.