convert nested json string column into map type column in spark

Question

overall aim

I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns which can later be queried downstream using the dot notation columnName.key

data nesting visualized

{
  key1: value1
  nestedType1: {
                key1: value1
                keyN: valueN
              }
  nestedType2: {
                key1: value1
                nestedKey: {
                             key1: value1
                             keyN: valueN
                           }
              }
  keyN: valueN
}

current approach and problem

I am not using the default spark json reader as it is resulting in some incorrect parsing of the files instead I am loading the files as text files and then parsing using udfs by using python's json module ( eg below ) post which I use explode and pivot to get the first level of keys into columns

@udf('MAP<STRING,STRING>' ) 
def get_key_val(x):
  try:
    return json.loads(x)
  except:
    return None

Post this initial transformation I now need to convert the nestedType columns to valid map types as well. Now since the initial function is returning map<string,string> the values in nestedType columns are not valid jsons so I cannot use json.loads, instead I have regex based string operations

@udf('MAP<STRING,STRING>' )  
def convert_map(string):
  try:
    regex = re.compile(r"""\w+=.*?(?:(?=,(?!"))|(?=}))""")
    obj = dict([(a.split('=')[0].strip(),(a.split('=')[1])) for a in regex.findall(s)])
    return obj
  except Exception as e:
    return e

this is fine for second level of nesting but if I want to go further that would require another udf and subsequent complications.

question

How can I use a spark udf or native spark functions to parse the nested json data such that it is queryable in columnName.key format.

also there is no restriction of spark version, hopefully I was able to explain this properly. do let me know if you want me to put some sample data and the code for ease. Any help is appreciated.

convert nested json string column into map type column in spark

0 Answers0