Using Azure Synapse pyspark filter flatten the nested json objects based on nested object's data type

Question

I am working on Azure Synapse pyspark on flattening the nested json data. json file has json objects with nested data as shown below, here cords is of type struct for 1 and 3rd record and string for 2nd record. When I printed the schema using df.printSchema() it prints cords type as string, if i remove the 2nd row json object then it prints the schema of type struct. Here I want to filter the json objects based on the cords data type so that I can flatten the cords struct nested data. where as for 2nd record flattening is not required. please can one help me on this?

{"dateTime":"2020-11-29T13:51:16.168659Z","cords":{"x_al":0.0191342489,"y_al":-0.1200904993}}

{"dateTime":"2020-12-29T13:51:21.457739Z","cords":51.0}

{"dateTime":"2021-10-29T13:51:26.634289Z","cords":{"x_al":0.01600042489,"y_al":-0.1200900993}}

What are your expected results? What have you tried so far? It would be better if you provided a working example, create a dataframe, add the data to make it easier for people to work on. — wBob, Nov 19 '21 at 18:53
@wBob Thank you for your reply. Goal is to extract the json data from the source file and flatten all its nested json data. Source file has huge complex data and above json data is only part sample of it. In Source file few json objects has values like "cords":51.0 hence df.printSchema() treats cords data type as string, because of this I am unable to flatten the cords data. just for testing purpose I copied 1st and 2nd row json objects in to seperate valid.json file, flattening the nested data from valid.json file works like a charm because df.printSchema() treats cord data type as struct. — Palaksha K S, Nov 22 '21 at 03:07

score 0 · Answer 1 · answered Nov 20 '21 at 11:13

You can import pandas into your code and them load the data with it as below:

df = pd.DataFrame([flatten_json(data)])

From the above code line, we assume that "data" is variable which is storing JSON structured data.

Also we have multiple scenarios in flatten in the data about per your three json types.

If you have just a dict, then you can use flatten_json(data)
If you have multiple dicts like [{},{}.{}], then you can use as [flatten_json(x) for x in data]
If you have multiple values like {1: {}, 2: {}, 3: {}} then you should use as [flatten_json(data[key]) for key in data.keys()]

For better understanding in Pyspark refer to this blog, thanks to towardsdatascience for clear explanation.

Thank You for your reply, as mentioned above in my comments section. Using normal pyspark code I am successfully able to flatten the nested json data if the "cords":51.0 doesn't have any this scalar values. hope you got my problem. — Palaksha K S, Nov 22 '21 at 03:29

Using Azure Synapse pyspark filter flatten the nested json objects based on nested object's data type

1 Answers1