Convert PANDAS dataframe to nested JSON + add array name

Question

I've been wresting with this for many days now and would appreciate any help.

I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:

account_id	name	timestamp	value
A0001C	Fund_1	1588618800000000000	1
B0001B	Dev_2	1601578800000000000	1

I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.

Here is the output i'm looking for:

{
  "records": [
    {
      "name": "Fund_1", 
      "account_id": "A0001C", 
      "metrics": [
        {
          "timestamp": 1588618800000000000, 
          "value": 1
        }
      ]
    }
    {
      "name": "Dev_2", 
      "account_id": "B0001B", 
      "metrics": [
        {
          "timestamp": 1601578800000000000, 
          "value": 1
        } 
      ]
    }
  ]
}

I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.

for record in df.to_dict(orient='records'):

        record_data = {'records': [record]}
        payload_json = json.dumps(record_data)
        print(payload_json)

I get the following output:

{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}

Any help on how i can modify my code to add the metrics label and nest the data.

Thanks in advance.

is the data particularly large or are there any constraints which make adding an auxiliary column/using `map` infeasible? — sim, Jan 26 '21 at 19:39
the data is 4000 row in the Excel sheet. I'm a real novice at coding so not sure on feasibilities of using map. I can add columns to the dataframe, but I'm still ot able to nest the JSON output — Graham McD, Jan 26 '21 at 20:22

sim · Answer 1 · 2021-01-27T13:48:35.290

One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.

In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:

records = list(df.apply(lambda row: {"name": row["name"],
                                     "account_id": row["account_id"],
                                     "metrics": [{
                                         "timestamp": row["timestamp"],
                                         "value": row["value"]}]
                                    },
                        axis=1).values)
payload = {"records": records}

Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):

df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
                                     "value": e.value}], 
                         axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}

Here's a full example applying option 2:

import io
import json

import pandas as pd

data = io.StringIO("""account_id    name    timestamp   value
A0001C  Fund_1  1588618800000000000 1
B0001B  Dev_2   1601578800000000000 1""")

df = pd.read_csv(data, sep="\t")

df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
                                     "value": e.value}], 
                         axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))

Output:

{
    "records": [
        {
            "account_id": "A0001C",
            "name": "Fund_1",
            "metrics": [
                {
                    "timestamp": 1588618800000000000,
                    "value": 1
                }
            ]
        },
        {
            "account_id": "B0001B",
            "name": "Dev_2",
            "metrics": [
                {
                    "timestamp": 1601578800000000000,
                    "value": 1
                }
            ]
        }
    ]
}

Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:

import io
import json

import pandas as pd

data = io.StringIO("""account_id    name    timestamp   value
A0001C  Fund_1  1588618800000000000 1
A0001C  Fund_1  1588618900000000000 2
B0001B  Dev_2   1601578800000000000 1""")

df = pd.read_csv(data, sep="\t")

# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
                                    "value": e.value}, 
                         axis=1)

# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))

Output:

{
    "records": [
        {
            "account_id": "B0001B",
            "name": "Dev_2",
            "metrics": [
                {
                    "timestamp": 1601578800000000000,
                    "value": 1
                }
            ]
        },
        {
            "account_id": "A0001C",
            "name": "Fund_1",
            "metrics": [
                {
                    "timestamp": 1588618800000000000,
                    "value": 1
                },
                {
                    "timestamp": 1588618900000000000,
                    "value": 2
                }
            ]
        }
    ]
}

Sim, thank you for your help and solution. one slight issue still is that the second account entry has both timestamp / value included and the value seems to have increased. I only need to have the timestamp / value data relating to the account_id. Suggestions would be appreciated. — Graham McD, Jan 27 '21 at 08:29
@GrahamMcD: If you use the code before the edit, that will just produce single (timestamp, value)-entries for each record. — sim, Jan 27 '21 at 08:32
thank you again i needed to add a '[' to the metrics line "metrics": [ - Now works great. Thank you again for your help, much appreciated. — Graham McD, Jan 27 '21 at 09:32
it would be great to know how option 2 above could be made to work, so as to only include the 1 instance of timestamp. Just for knowledge, if you have any thoughts. — Graham McD, Jan 27 '21 at 13:44

Convert PANDAS dataframe to nested JSON + add array name

1 Answers1

Linked