0

I have a flat dataframe df_flat (no nested structure) which I need to save in a specific structure. The resulting schema should look like this:

|-- risk: struct (nullable = true)
|    |-- last_month: struct (nullable = true)
|    |    |-- activity: struct (nullable = true)
|    |    |    |-- payments: struct (nullable = true)
|    |    |    |    |-- summary: struct (nullable = true)
|    |    |    |    |    |-- amount: struct (nullable = true)
|    |    |    |    |    |    |-- total: long (nullable = true)
|    |    |    |    |    |-- total: long (nullable = true)

I have a JSON file json_reference_df with only one record used as a reference with the schema above. I tried to leverage that the structure exists there already by appending my flat dataframe and the JSON file by:

df_1 = df_flat.unionByName(json_reference_df, allowMissingColumns=True)
df_1.printSchema()

#|-- last_month_symmary_total: long (nullable = true)
#|-- risk: struct (nullable = true)
#|    |-- last_month: struct (nullable = true)
#|    |    |-- activity: struct (nullable = true)
#|    |    |    |-- payments: struct (nullable = true)
#|    |    |    |    |-- summary: struct (nullable = true)
#|    |    |    |    |    |-- amount: struct (nullable = true)
#|    |    |    |    |    |    |-- total: long (nullable = true)
#|    |    |    |    |    |-- total: long (nullable = true)

Since I'm using Spark 3.1+ I tried to use withFields to replace the null values of the appended nested structure with the value of the column in the flat dataframe last_month_symmary_total, but the "replaced" values are still null:

df_1.withColumn('risk', 
            F.col('risk').withField('last_month.activity.payments.summary', F.struct(
                F.col('last_month_symmary_total').alias('total')
            ) ) )\
.select('last_month_symmary_total', 'risk.last_month.activity.payments.summary.total')\
.show()

#+------------------------+-----+
#|last_month_symmary_total|total|
#+------------------------+-----+
#|                      15| null|
#|                      29| null|
#|                       0| null|
#|                       5| null|
#|                       8| null|
#+------------------------+-----+
  1. Would using the pre existing structure in the json_reference_df dataframe be a good way to tackle this?
  2. Why aren't the values being replaced and still null?
  3. What would be another approach to create the nested structure and and save the current values in the flat dataframe?
jcbslmn
  • 1
  • 1
  • Does this answer your question? [update nested struct with null values](https://stackoverflow.com/questions/75490013/update-nested-struct-with-null-values) – SamJ Feb 28 '23 at 15:06

0 Answers0