0

I have a mapping/lookup table/DF according to which I have to extract values from a highly nested json/dictionary. These values have to be inserted as column values to a delta table. How do I do this leveraging pyspark's parallelism?

I know I can collect() the mapping dataframe, open the json file and update each column of a row of a temp df and append to delta table but that will not run in parrallel.

Alternatively, I broadcast the dict/JSON, iterate over mapping dataframe using foreach() and according to when condition I upsert my delta table. But column.when() does not allow me to update a delta table nor does the delta.tables.merge() allow me to compare a dataframe and a dict.

John Harrington
  • 1,314
  • 12
  • 36
Takreem
  • 1
  • 2

0 Answers0