1

I need to flat/normalize very large hierarchical JSON records. I've tried the examples with json_normalize but I don't know exactly the name of the columns I must pass in the record_path param exactly the other columns for meta.

This is a sample of how complicated the JSON record could be:

d2 = {'key2':None, 
'a':1 , 'b':{'v':[1,2,3]}, 
'c': {'c1':'c1', 'c2':'c2'}, 
'd': {'d1': 1, 'd2':{'d3': 'd3', 'd4': 'd4', 'd5':[3,3,3,4]}}, 
'key':
{'seqOf': [{'seqOf': 
{'dedicatedAccountID': '191', 'campaignIdentifier': None, 'transactionAmount': b'106.670000',
 'adjustmentAmount': None, 'accountBalance': b'122.000000', 'accountExpiryDateBefore': None, 
'accountExpiryDateAfter': None, 'accountStartDateBefore': None, 'accountStartDateAfter': None,
 'mainDedicatedAccountID': None, 'offerIdentifier': None, 'dedicatedAccountUnit': '1', 
'transactionUnits': None, 'adjustmentUnits': None, 'unitBalance': None, 'realMoneyFlag': None}
}]}}

I also tried to do for every key and then join the columns, but as I said, the files are really complex. Performance is not an issue, I'm running this offline, I only need a few of those JSON records flattened to CSV.

Is there any automatic tool that would do it?

Thadeu Melo
  • 947
  • 14
  • 40

2 Answers2

0

When there is a list as the value of a key, json_normalize method will not go any further as it might ruin the data. Lists can be multiple rows, or extension of the data frames. Here is the extension solution for your example:

df = pd.json_normalize(d2)

df_extension = pd.json_normalize(df['key.seqOf'].str[0].to_list())
df = pd.concat([df,df_extension], axis=1).drop('key.seqOf', axis=1)

output:

key2 a b.v c.c1 c.c2 d.d1 d.d2.d3 d.d2.d4 d.d2.d5 seqOf.dedicatedAccountID seqOf.campaignIdentifier seqOf.transactionAmount seqOf.adjustmentAmount seqOf.accountBalance seqOf.accountExpiryDateBefore seqOf.accountExpiryDateAfter seqOf.accountStartDateBefore seqOf.accountStartDateAfter seqOf.mainDedicatedAccountID seqOf.offerIdentifier seqOf.dedicatedAccountUnit seqOf.transactionUnits seqOf.adjustmentUnits seqOf.unitBalance seqOf.realMoneyFlag
0 1 [1, 2, 3] c1 c2 1 d3 d4 [3, 3, 3, 4] 191 106.67 122 1

Or if the list is counted as rows by exploding the d.d2.d5 column:

df = df.explode('d.d2.d5')

result:

key2 a b.v c.c1 c.c2 d.d1 d.d2.d3 d.d2.d4 d.d2.d5 seqOf.dedicatedAccountID seqOf.campaignIdentifier seqOf.transactionAmount seqOf.adjustmentAmount seqOf.accountBalance seqOf.accountExpiryDateBefore seqOf.accountExpiryDateAfter seqOf.accountStartDateBefore seqOf.accountStartDateAfter seqOf.mainDedicatedAccountID seqOf.offerIdentifier seqOf.dedicatedAccountUnit seqOf.transactionUnits seqOf.adjustmentUnits seqOf.unitBalance seqOf.realMoneyFlag
0 1 [1, 2, 3] c1 c2 1 d3 d4 3 191 106.67 122 1
0 1 [1, 2, 3] c1 c2 1 d3 d4 3 191 106.67 122 1
0 1 [1, 2, 3] c1 c2 1 d3 d4 3 191 106.67 122 1
0 1 [1, 2, 3] c1 c2 1 d3 d4 4 191 106.67 122 1

The main point is that how you want your data? In cases such as yours, the data might not be in form of a relational database such as generic SQL, but it could be in form of documents like MongoDB instances. You need to decide how to clean your data depending on the further use of it.

Babak Fi Foo
  • 926
  • 7
  • 17
0

One approach is to serialize the dictionary to a string with json.dumps, and deserialize it back to a dictionary with json.loads with an object hook (see example my_obj_hook below) that appends non-container values of the parent and all children dictionaries to FLAT_MAP.

from json import dumps, loads


FLAT_MAP = dict()


def my_obj_hook(obj):
    """append k-v pairs to flat map"""
    for k,v in obj.items():
        if not isinstance(v, (dict, list)):
            FLAT_MAP[k] = v
    return obj


d2 = dumps(d2)
d2 = loads(d2, object_hook=my_obj_hook)

Here is the documentation for the object_hook parameter:

object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders (e.g. JSON-RPC class hinting).

thebadgateway
  • 433
  • 1
  • 4
  • 7