0

I have for example three json's as a string inside a dataframe column:

df = spark.createDataFrame(
    [
        ["""{"header":{"title":"ABC123","name":"test"},"results":{"data_A":[{"key":"value"},{"key":"value"}]}, "data_B" : {"a":"1"}}"""]
        ,
        ["""{"header":{"title":"ABC123","name":"test"},"results":{"data_A":{"key":"value"}}, "data_B" : null}"""]
        ,
        ["""{"header":{"title":"ABC123","name":"test"},"results":{"data_A":[{"key":"value"},{"key":"value"}]}, "data_B" : [{"a":"1"}, {"a":"2"}]}"""]
    ], ['payload']
)

As you can see, the data can have multiple values inside data_A and data_B but can sometimes have nothing (null) or only one value. How to get this data inside one dataframe with a generic schema?

When trying code in this linkit will result in a null because it cannot insert one value inside an arraytype. https://i.stack.imgur.com/O3Tlu.png

I tried using the solution within this link but it's not working. Unify schema across multiple rows of json strings in Spark Dataframe

Desired result is to have a column with the 'payload' as a struct field so we can navigate through the data.

payload:struct
---data_B:array
------element:struct
---------a:string
---header:struct
------name:string
------title:string
---results:struct
------data_A:array
---------element:struct
------------key:string
------elt:array
---------element:struct
------------key:string
---------------test:struct
------------------test2:struct
---------------------elt:array
------------------------element:struct
---------------------------A:long
---------------------------B:long
------test:struct
---------a:struct
------------elt:array
---------------element:array
------------------element:struct
---------------------ab:long
---------b:string
pooltje
  • 1
  • 1

0 Answers0