0

overall aim

I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns which can later be queried downstream using the dot notation columnName.key

data nesting visualized

{
  key1: value1
  nestedType1: {
                key1: value1
                keyN: valueN
              }
  nestedType2: {
                key1: value1
                nestedKey: {
                             key1: value1
                             keyN: valueN
                           }
              }
  keyN: valueN
} 

current approach and problem

I am not using the default spark json reader as it is resulting in some incorrect parsing of the files instead I am loading the files as text files and then parsing using udfs by using python's json module ( eg below ) post which I use explode and pivot to get the first level of keys into columns

@udf('MAP<STRING,STRING>' ) 
def get_key_val(x):
  try:
    return json.loads(x)
  except:
    return None

Post this initial transformation I now need to convert the nestedType columns to valid map types as well. Now since the initial function is returning map<string,string> the values in nestedType columns are not valid jsons so I cannot use json.loads, instead I have regex based string operations

@udf('MAP<STRING,STRING>' )  
def convert_map(string):
  try:
    regex = re.compile(r"""\w+=.*?(?:(?=,(?!"))|(?=}))""")
    obj = dict([(a.split('=')[0].strip(),(a.split('=')[1])) for a in regex.findall(s)])
    return obj
  except Exception as e:
    return e

this is fine for second level of nesting but if I want to go further that would require another udf and subsequent complications.

question

How can I use a spark udf or native spark functions to parse the nested json data such that it is queryable in columnName.key format.

also there is no restriction of spark version, hopefully I was able to explain this properly. do let me know if you want me to put some sample data and the code for ease. Any help is appreciated.

Amitoz
  • 30
  • 7

0 Answers0