0

I have a Dataframe with rows which will be saved to different target tables. Right now, I'm finding the unique combination of parameters to determine the target table, iterating over the Dataframe and filtering, then writing.

Something similar to this:

df = spark.load.json(directory).repartition('client', 'region')

unique_clients_regions = [(group.client, group.region) for group in df.select('client', 'region').distinct().collect()]

for client, region in unique_clients_regions:
  (df
   .filter(f"client = '{client}' and region = '{region}'")
   .select(
     ...
   )
   .write.mode("append")
   .saveAsTable(f"{client}_{region}_data") 
  )

Is there a way to map the write operation to different groupBy groups instead of having to iterate over the distinct set? I made sure to repartition by client and region to try and speed up performance of the filter.

TomNash
  • 3,147
  • 2
  • 21
  • 57

1 Answers1

1

I cannot, in good conscience , advice you anything using this solution. Actually that's a really bad data architecture.
You should have only one table and partition by client and region. That will create different folders for each couple client/region. And you only need one write in the end and no loop nor collect.

spark.read.json(directory).write.saveAsTable(
    "data",
    mode="append",
    partitionBy=['client', 'region']
)
Steven
  • 14,048
  • 6
  • 38
  • 73
  • Thanks @Steven, in the process of learning Spark and this kind of architecture. The issue is that there are different columns/schemas upstream of this per client/region which can change over time so different tables for each made sense. – TomNash Aug 19 '21 at 14:06
  • @TomNash if the input is a unic json file, it means that the schema is the same for each couple of client/region. So no need to have multiple tables. – Steven Aug 19 '21 at 14:07
  • It's not. That's the issue. Multiple JSON records of differing schemas in one file. Trying to handle each individual one. – TomNash Aug 19 '21 at 14:09
  • @TomNash That's not how you handle them. From the moment you do this `df = spark.load.json(directory)`, df will have a global schema unifiying all the jsons you have. Which means that the final table will also have this unified schema. You can filter as much as you want, it wont change the schema. – Steven Aug 19 '21 at 14:10
  • Yeah that's the issue we've run into, I can `spark.load.text` and then handle them `get_json_object` on the filtered sets to get the localized schema for that group though without a global one being defined. The issue still remains of how to write rows of one Dataframe to different destinations without looping. The alternative is to load one file at a time into its own Dataframe and write to the table one row at a time. – TomNash Aug 19 '21 at 14:18
  • @TomNash so basically, you are showing a problem that does not fit your current situation. – Steven Aug 19 '21 at 14:48