PySpark problem flattening array with nested JSON and other elements

Question

I'm struggling with the correct syntax to flatten some data.

I have a dlt table with a column (named lorem for the sake of the example) where each row looks like this:

[{"field1": {"field1_1": null, "field1_2": null}, 
  "field2": "blabla", "field3": 13209914, 
  "field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]

I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.

Table should look like:

|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|

However when I explode like: select(explode("lorem")) I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.

My question is, in what other way should I be flattening this data? I can provide a clearer example if needed.

That is not valid JSON, eg all the quotes are missing. Does your data actually look like that? I think if you can spend a bit more effort into making your sample data accurate and provide your expected results (what do you want it to look like after?) I am sure someone will be able to help you. — wBob, Oct 16 '21 at 10:53
Edited and added a much better JSON representation of what I'm dealing with — António Mendes, Oct 18 '21 at 13:54
That's useful, thank you. I am still not 100% clear on what your expected results should be? Do you want one row per single property in the JSON, irrespective of whether it's nested or not? Please add a table which describes your desired output. — wBob, Oct 18 '21 at 14:34
Done! The table is parsed correctly in the preview but not when I post it though — António Mendes, Oct 18 '21 at 14:39

score 1 · Accepted Answer · answered Oct 18 '21 at 16:26

Use withColumn to add the additional columns you need. A simple example:

%%pyspark
from pyspark.sql.functions import col

df = spark.read.json("abfss://somelake@somestorage.dfs.core.windows.net/raw/flattenJson.json")

df2 = df \
    .withColumn("field4_1", col("field4.field4_1")) \
    .withColumn("field4_2", col("field4.field4_2"))

df2.show()

My results:

PySpark problem flattening array with nested JSON and other elements

1 Answers1