I have a mapping/lookup table/DF according to which I have to extract values from a highly nested json/dictionary. These values have to be inserted as column values to a delta table. How do I do this leveraging pyspark's parallelism?
I know I can collect()
the mapping dataframe, open the json file and update each column of a row of a temp df and append to delta table but that will not run in parrallel.
Alternatively, I broadcast
the dict/JSON, iterate over mapping dataframe using foreach()
and according to when
condition I upsert my delta table. But column.when()
does not allow me to update a delta table nor does the delta.tables.merge()
allow me to compare a dataframe and a dict.