How to handle the variable size json file in python to create DataFrame using pandas

Question

I am trying to build a DataFrame using pandas but I am not able to handle the case when I have the variable size of JSON chunks I am getting.

eg: 1st chunk:

{'ad': 0,
 'country': 'US',
 'ver': '1.0',
 'adIdType': 2,
 'adValue': '5',
 'data': {'eventId': 99,
  'clickId': '',
  'eventType': 'PURCHASEMADE',
  'tms': '2019-12-25T09:57:04+0000',
  'productDetails': {'currency': 'DLR',
   'productList': [
    {'segment': 'Girls',
     'vertical': 'Fashion Jewellery',
     'brickname': 'Traditional Jewellery',
     'price': 8,
     'quantity': 10}]},
  'transactionId': '1254'},
 'appName': 'xer.tt',
 'appId': 'XR',
 'sdkVer': '1.0.0',
 'language': 'en',
 'tms': '2022-04-25T09:57:04+0000',
 'tid': '124'}

2nd chunk:

{'ad': 0,
 'country': 'US',
 'ver': '1.0',
 'adIdType': 2,
 'adValue': '78',
 'data': {'eventId': 7,
  'clickId': '',
  'eventType': 'PURCHASEMADE',
  'tms': '20219-02-25T09:57:04+0000',
  'productDetails': {'currency': 'DLR',
   'productList': [{'segment': 'Boys',
     'vertical': 'Fashion',
     'brickname': 'Casuals',
     'price': 10,
     'quantity': 5},
    {'segment': 'Girls',
     'vertical': 'Fashion Jewellery',
     'brickname': 'Traditional Jewellery',
     'price': 8,
     'quantity': 10}]},
  'transactionId': '3258'},
 'appName': 'xer.tt',
 'appId': 'XR',
 'sdkVer': '1.0.0',
 'language': 'en',
 'tms': '2029-02-25T09:57:04+0000',
 'tid': '124'}

Now in the ProductDetails the number of products are getting changes, in the first chunk we have only 1 product listed and it's detailed but in the 2nd chunk, we have 2 products listed and it's detailed, for further chunks we can have ANY number of products for other chunks also. (i.e. chunks~Records)

I tried doing that by writing some python scripts but was not able to come to any good solution.

PS: If any further detail is required please let me know in the comments.

Thanks!

@Joe I am still thinking about it. because if people have this type of data then how do they manage it and do its further preprocessing. trying to think in a way like: [other_cols] [product_1] [product_2] [product_3] and so on but this may be a bad approach because there are many chances that most of the value will go NAN. can Joe we have some other libraries or some utility in python which can handle this type of case in another way, I am totally unaware of this type of data. Ps: I need to build a recommendation engine using ML. and for that, I need it to parse this. — codex, Apr 29 '22 at 06:28

score 1 · Accepted Answer · answered Apr 29 '22 at 09:55

What you can do, is use pd.json_normalize and have the most "inner" dictionary as your record_path and all other data you are interested in as your meta . Here is an in-depth example how you could construct that: pandas.io.json.json_normalize with very nested json

In your case, that would for example be (for a single object):

df = pd.json_normalize(obj, 
                         record_path=["data", "productDetails", "productList"], 
                         meta=([
                             ["data", "productDetails", "currency"],
                             ["data", "transactionId"],
                             ["data", "clickId"],
                             ["data", "eventType"],
                             ["data", "tms"],
                             "ad",
                             "country"
                             ])
)

Thanks a lot, @braml1 looking into it. Thanks for your time. — codex, Apr 29 '22 at 11:19

How to handle the variable size json file in python to create DataFrame using pandas

1 Answers1