I have a json stored as string in the below format
{
'aaa':'',
'bbb':'',
'ccc':{
'ccc':[{dict of values}] //list of dictionaries
}
'ddd':'',
'eee':{
'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
}
}
I have nearly some 70mil json strings in this format. I thought to use json_normalize from python pandas, but because of the record count, I am thinking to use pyspark. Could someone guide what is the best way to process and store these json strings in a Glue table. I would need an output with all the keys in json as columns along with their data as rows and I would store them as parquet files.
Also in some cases, not all the keys will be present and in that case, I need to store null or none in that key for that json string
Sample input: {"aaa":"123","bbb":"asdncj","ccc":{"ccc":[{"ccc1":true,"ccc2":"abcd","ccc3":"abcd"},{"ccc1":true,"ccc2":"abcde","ccc3":"abcdee"},{"ccc1":true,"ccc2":"abcdef","ccc3":"abcdefe"}]},"ddd":"aabcd","eee":{"eee":[{"eee1":"123","eee2":"1","eee3":"hcudh"},{"eee1":"2234","eee2":"1","eee3":"hhcb"}]}}
output, I want to have 3 tables, one for keys aaa,bbb and ddd. the second for keys in ccc and the third table for eee.