0

I have a json stored as string in the below format

{
'aaa':'',
'bbb':'',
'ccc':{
       'ccc':[{dict of values}] //list of dictionaries
      }
'ddd':'',
'eee':{
       'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
      }
}

I have nearly some 70mil json strings in this format. I thought to use json_normalize from python pandas, but because of the record count, I am thinking to use pyspark. Could someone guide what is the best way to process and store these json strings in a Glue table. I would need an output with all the keys in json as columns along with their data as rows and I would store them as parquet files.

Also in some cases, not all the keys will be present and in that case, I need to store null or none in that key for that json string

Sample input: {"aaa":"123","bbb":"asdncj","ccc":{"ccc":[{"ccc1":true,"ccc2":"abcd","ccc3":"abcd"},{"ccc1":true,"ccc2":"abcde","ccc3":"abcdee"},{"ccc1":true,"ccc2":"abcdef","ccc3":"abcdefe"}]},"ddd":"aabcd","eee":{"eee":[{"eee1":"123","eee2":"1","eee3":"hcudh"},{"eee1":"2234","eee2":"1","eee3":"hhcb"}]}}

output, I want to have 3 tables, one for keys aaa,bbb and ddd. the second for keys in ccc and the third table for eee.

  • Can you add some sample Input / Output . Accordingly we can try to flatten the json using pyspark – Anjaneya Tripathi Jun 07 '22 at 08:37
  • Hello, I have added the sample input and output in the question now – Amaravathi Satya Jun 07 '22 at 15:58
  • @AmaravathiSatya is the schema of the json going be fixed? See if this helps(might help you handle complex json): https://docs.databricks.com/delta/data-transformation/complex-types.html You might have to leverage explode function [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.explode.html) ( Beware of same col name, rename col might be needed). Post this you can leverage writing to parquet using [this] (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) Not sure about glue table part. – teedak8s Jun 10 '22 at 21:24

1 Answers1

0

If you don't necessarily need to use pyspark (e.g you just need to read it). Then I would recommend using the built-in json module. The below is an example of a use:

import json

with open("your_file.json", "r") as f:
    raw_json = json.load(f)

Could you also elaborate on how to json data needs to be formated.

error 1044
  • 111
  • 1
  • 6