0

I'm struggling with the correct syntax to flatten some data.

I have a dlt table with a column (named lorem for the sake of the example) where each row looks like this:

[{"field1": {"field1_1": null, "field1_2": null}, 
  "field2": "blabla", "field3": 13209914, 
  "field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]

I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.

Table should look like:

|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|

However when I explode like: select(explode("lorem")) I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.

My question is, in what other way should I be flattening this data? I can provide a clearer example if needed.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
António Mendes
  • 173
  • 1
  • 10
  • That is not valid JSON, eg all the quotes are missing. Does your data actually look like that? I think if you can spend a bit more effort into making your sample data accurate and provide your expected results (what do you want it to look like after?) I am sure someone will be able to help you. – wBob Oct 16 '21 at 10:53
  • Edited and added a much better JSON representation of what I'm dealing with – António Mendes Oct 18 '21 at 13:54
  • That's useful, thank you. I am still not 100% clear on what your expected results should be? Do you want one row per single property in the JSON, irrespective of whether it's nested or not? Please add a table which describes your desired output. – wBob Oct 18 '21 at 14:34
  • Done! The table is parsed correctly in the preview but not when I post it though – António Mendes Oct 18 '21 at 14:39

1 Answers1

1

Use withColumn to add the additional columns you need. A simple example:

%%pyspark
from pyspark.sql.functions import col

df = spark.read.json("abfss://somelake@somestorage.dfs.core.windows.net/raw/flattenJson.json")

df2 = df \
    .withColumn("field4_1", col("field4.field4_1")) \
    .withColumn("field4_2", col("field4.field4_2"))

df2.show()

My results:

enter image description here

wBob
  • 13,710
  • 3
  • 20
  • 37