JSON Fomatting in Pyspark

Question

I have a json stored as string in the below format

{
'aaa':'',
'bbb':'',
'ccc':{
       'ccc':[{dict of values}] //list of dictionaries
      }
'ddd':'',
'eee':{
       'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
      }
}

I have nearly some 70mil json strings in this format. I thought to use json_normalize from python pandas, but because of the record count, I am thinking to use pyspark. Could someone guide what is the best way to process and store these json strings in a Glue table. I would need an output with all the keys in json as columns along with their data as rows and I would store them as parquet files.

Also in some cases, not all the keys will be present and in that case, I need to store null or none in that key for that json string

Sample input: {"aaa":"123","bbb":"asdncj","ccc":{"ccc":[{"ccc1":true,"ccc2":"abcd","ccc3":"abcd"},{"ccc1":true,"ccc2":"abcde","ccc3":"abcdee"},{"ccc1":true,"ccc2":"abcdef","ccc3":"abcdefe"}]},"ddd":"aabcd","eee":{"eee":[{"eee1":"123","eee2":"1","eee3":"hcudh"},{"eee1":"2234","eee2":"1","eee3":"hhcb"}]}}

output, I want to have 3 tables, one for keys aaa,bbb and ddd. the second for keys in ccc and the third table for eee.

Can you add some sample Input / Output . Accordingly we can try to flatten the json using pyspark — Anjaneya Tripathi, Jun 07 '22 at 08:37
Hello, I have added the sample input and output in the question now — Amaravathi Satya, Jun 07 '22 at 15:58
@AmaravathiSatya is the schema of the json going be fixed? See if this helps(might help you handle complex json): https://docs.databricks.com/delta/data-transformation/complex-types.html You might have to leverage explode function [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.explode.html) ( Beware of same col name, rename col might be needed). Post this you can leverage writing to parquet using [this] (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) Not sure about glue table part. — teedak8s, Jun 10 '22 at 21:24

score 0 · Answer 1 · answered Jun 06 '22 at 14:23

0

If you don't necessarily need to use pyspark (e.g you just need to read it). Then I would recommend using the built-in json module. The below is an example of a use:

import json

with open("your_file.json", "r") as f:
    raw_json = json.load(f)

Could you also elaborate on how to json data needs to be formated.

answered Jun 06 '22 at 14:23

error 1044

111
1
6

Hey, I have mentioned the same in the second part of the question more elaborately. Could you please check it and let me know. – Amaravathi Satya Jun 06 '22 at 15:40

JSON Fomatting in Pyspark

1 Answers1