Flattening a very large Json with nested lists and nested dict list

Question

I need to flat/normalize very large hierarchical JSON records. I've tried the examples with json_normalize but I don't know exactly the name of the columns I must pass in the record_path param exactly the other columns for meta.

This is a sample of how complicated the JSON record could be:

d2 = {'key2':None, 
'a':1 , 'b':{'v':[1,2,3]}, 
'c': {'c1':'c1', 'c2':'c2'}, 
'd': {'d1': 1, 'd2':{'d3': 'd3', 'd4': 'd4', 'd5':[3,3,3,4]}}, 
'key':
{'seqOf': [{'seqOf': 
{'dedicatedAccountID': '191', 'campaignIdentifier': None, 'transactionAmount': b'106.670000',
 'adjustmentAmount': None, 'accountBalance': b'122.000000', 'accountExpiryDateBefore': None, 
'accountExpiryDateAfter': None, 'accountStartDateBefore': None, 'accountStartDateAfter': None,
 'mainDedicatedAccountID': None, 'offerIdentifier': None, 'dedicatedAccountUnit': '1', 
'transactionUnits': None, 'adjustmentUnits': None, 'unitBalance': None, 'realMoneyFlag': None}
}]}}

I also tried to do for every key and then join the columns, but as I said, the files are really complex. Performance is not an issue, I'm running this offline, I only need a few of those JSON records flattened to CSV.

Is there any automatic tool that would do it?

@galaxyan A flat CSV file without any nested list or dict. It is a requirement from the data team. I believe they could use Azure Data Factory for it, but it wasn't decided yet. — Thadeu Melo, Sep 08 '21 at 15:17
@galaxyan pure lists could be exploded. but if a list has nested dict we also need it to be flatenned, — Thadeu Melo, Sep 08 '21 at 15:40

score 0 · Answer 1 · answered Sep 08 '21 at 18:14

When there is a list as the value of a key, json_normalize method will not go any further as it might ruin the data. Lists can be multiple rows, or extension of the data frames. Here is the extension solution for your example:

df = pd.json_normalize(d2)

df_extension = pd.json_normalize(df['key.seqOf'].str[0].to_list())
df = pd.concat([df,df_extension], axis=1).drop('key.seqOf', axis=1)

output:

	key2	a	b.v	c.c1	c.c2	d.d1	d.d2.d3	d.d2.d4	d.d2.d5	seqOf.dedicatedAccountID	seqOf.campaignIdentifier	seqOf.transactionAmount	seqOf.adjustmentAmount	seqOf.accountBalance	seqOf.accountExpiryDateBefore	seqOf.accountExpiryDateAfter	seqOf.accountStartDateBefore	seqOf.accountStartDateAfter	seqOf.mainDedicatedAccountID	seqOf.offerIdentifier	seqOf.dedicatedAccountUnit	seqOf.transactionUnits	seqOf.adjustmentUnits	seqOf.unitBalance	seqOf.realMoneyFlag
0		1	[1, 2, 3]	c1	c2	1	d3	d4	[3, 3, 3, 4]	191		106.67		122							1

Or if the list is counted as rows by exploding the d.d2.d5 column:

df = df.explode('d.d2.d5')

result:

a	b.v	c.c1	c.c2	d.d1	d.d2.d3	d.d2.d4	d.d2.d5	seqOf.dedicatedAccountID	seqOf.transactionAmount	seqOf.accountBalance	seqOf.dedicatedAccountUnit
1	[1, 2, 3]	c1	c2	1	d3	d4	3	191	106.67	122	1
1	[1, 2, 3]	c1	c2	1	d3	d4	3	191	106.67	122	1
1	[1, 2, 3]	c1	c2	1	d3	d4	3	191	106.67	122	1
1	[1, 2, 3]	c1	c2	1	d3	d4	4	191	106.67	122	1

The main point is that how you want your data? In cases such as yours, the data might not be in form of a relational database such as generic SQL, but it could be in form of documents like MongoDB instances. You need to decide how to clean your data depending on the further use of it.

score 0 · Answer 2 · answered Sep 08 '21 at 18:34

One approach is to serialize the dictionary to a string with json.dumps, and deserialize it back to a dictionary with json.loads with an object hook (see example my_obj_hook below) that appends non-container values of the parent and all children dictionaries to FLAT_MAP.

from json import dumps, loads


FLAT_MAP = dict()


def my_obj_hook(obj):
    """append k-v pairs to flat map"""
    for k,v in obj.items():
        if not isinstance(v, (dict, list)):
            FLAT_MAP[k] = v
    return obj


d2 = dumps(d2)
d2 = loads(d2, object_hook=my_obj_hook)

Here is the documentation for the object_hook parameter:

object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders (e.g. JSON-RPC class hinting).

Flattening a very large Json with nested lists and nested dict list

2 Answers2