1

I've been wresting with this for many days now and would appreciate any help.

I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:

account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
B0001B Dev_2 1601578800000000000 1

I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.

Here is the output i'm looking for:

{
  "records": [
    {
      "name": "Fund_1", 
      "account_id": "A0001C", 
      "metrics": [
        {
          "timestamp": 1588618800000000000, 
          "value": 1
        }
      ]
    }
    {
      "name": "Dev_2", 
      "account_id": "B0001B", 
      "metrics": [
        {
          "timestamp": 1601578800000000000, 
          "value": 1
        } 
      ]
    }
  ]
}

I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.

for record in df.to_dict(orient='records'):

        record_data = {'records': [record]}
        payload_json = json.dumps(record_data)
        print(payload_json)

I get the following output:

{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}

Any help on how i can modify my code to add the metrics label and nest the data.

Thanks in advance.

Graham McD
  • 11
  • 2
  • is the data particularly large or are there any constraints which make adding an auxiliary column/using `map` infeasible? – sim Jan 26 '21 at 19:39
  • the data is 4000 row in the Excel sheet. I'm a real novice at coding so not sure on feasibilities of using map. I can add columns to the dataframe, but I'm still ot able to nest the JSON output – Graham McD Jan 26 '21 at 20:22

1 Answers1

3

One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.

In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:

records = list(df.apply(lambda row: {"name": row["name"],
                                     "account_id": row["account_id"],
                                     "metrics": [{
                                         "timestamp": row["timestamp"],
                                         "value": row["value"]}]
                                    },
                        axis=1).values)
payload = {"records": records}

Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):

df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
                                     "value": e.value}], 
                         axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}

Here's a full example applying option 2:

import io
import json

import pandas as pd

data = io.StringIO("""account_id    name    timestamp   value
A0001C  Fund_1  1588618800000000000 1
B0001B  Dev_2   1601578800000000000 1""")

df = pd.read_csv(data, sep="\t")

df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
                                     "value": e.value}], 
                         axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))

Output:

{
    "records": [
        {
            "account_id": "A0001C",
            "name": "Fund_1",
            "metrics": [
                {
                    "timestamp": 1588618800000000000,
                    "value": 1
                }
            ]
        },
        {
            "account_id": "B0001B",
            "name": "Dev_2",
            "metrics": [
                {
                    "timestamp": 1601578800000000000,
                    "value": 1
                }
            ]
        }
    ]
}


Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:

import io
import json

import pandas as pd

data = io.StringIO("""account_id    name    timestamp   value
A0001C  Fund_1  1588618800000000000 1
A0001C  Fund_1  1588618900000000000 2
B0001B  Dev_2   1601578800000000000 1""")

df = pd.read_csv(data, sep="\t")

# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
                                    "value": e.value}, 
                         axis=1)

# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))

Output:

{
    "records": [
        {
            "account_id": "B0001B",
            "name": "Dev_2",
            "metrics": [
                {
                    "timestamp": 1601578800000000000,
                    "value": 1
                }
            ]
        },
        {
            "account_id": "A0001C",
            "name": "Fund_1",
            "metrics": [
                {
                    "timestamp": 1588618800000000000,
                    "value": 1
                },
                {
                    "timestamp": 1588618900000000000,
                    "value": 2
                }
            ]
        }
    ]
}
sim
  • 1,227
  • 14
  • 20
  • Sim, thank you for your help and solution. one slight issue still is that the second account entry has both timestamp / value included and the value seems to have increased. I only need to have the timestamp / value data relating to the account_id. Suggestions would be appreciated. – Graham McD Jan 27 '21 at 08:29
  • @GrahamMcD: If you use the code before the edit, that will just produce single (timestamp, value)-entries for each record. – sim Jan 27 '21 at 08:32
  • thank you again i needed to add a '[' to the metrics line "metrics": [ - Now works great. Thank you again for your help, much appreciated. – Graham McD Jan 27 '21 at 09:32
  • @GrahamMcD: You are right, sorry, didn't catch that. – sim Jan 27 '21 at 11:13
  • it would be great to know how option 2 above could be made to work, so as to only include the 1 instance of timestamp. Just for knowledge, if you have any thoughts. – Graham McD Jan 27 '21 at 13:44
  • @GrahamMcD: Added the example. – sim Jan 27 '21 at 13:49