I have a flat dataframe df_flat
(no nested structure) which I need to save in a specific structure. The resulting schema should look like this:
|-- risk: struct (nullable = true)
| |-- last_month: struct (nullable = true)
| | |-- activity: struct (nullable = true)
| | | |-- payments: struct (nullable = true)
| | | | |-- summary: struct (nullable = true)
| | | | | |-- amount: struct (nullable = true)
| | | | | | |-- total: long (nullable = true)
| | | | | |-- total: long (nullable = true)
I have a JSON file json_reference_df
with only one record used as a reference with the schema above. I tried to leverage that the structure exists there already by appending my flat dataframe and the JSON file by:
df_1 = df_flat.unionByName(json_reference_df, allowMissingColumns=True)
df_1.printSchema()
#|-- last_month_symmary_total: long (nullable = true)
#|-- risk: struct (nullable = true)
#| |-- last_month: struct (nullable = true)
#| | |-- activity: struct (nullable = true)
#| | | |-- payments: struct (nullable = true)
#| | | | |-- summary: struct (nullable = true)
#| | | | | |-- amount: struct (nullable = true)
#| | | | | | |-- total: long (nullable = true)
#| | | | | |-- total: long (nullable = true)
Since I'm using Spark 3.1+ I tried to use withFields to replace the null values of the appended nested structure with the value of the column in the flat dataframe last_month_symmary_total
, but the "replaced" values are still null:
df_1.withColumn('risk',
F.col('risk').withField('last_month.activity.payments.summary', F.struct(
F.col('last_month_symmary_total').alias('total')
) ) )\
.select('last_month_symmary_total', 'risk.last_month.activity.payments.summary.total')\
.show()
#+------------------------+-----+
#|last_month_symmary_total|total|
#+------------------------+-----+
#| 15| null|
#| 29| null|
#| 0| null|
#| 5| null|
#| 8| null|
#+------------------------+-----+
- Would using the pre existing structure in the
json_reference_df
dataframe be a good way to tackle this? - Why aren't the values being replaced and still
null
? - What would be another approach to create the nested structure and and save the current values in the flat dataframe?