Generate hierarchical data from pandas df to list

Question

I have data in this form

data = [
    [2019, "July", 8, '1.2.0', 7.0, None, None, None],
    [2019, "July", 10, '1.2.0', 52.0, "Breaking", 6.0, 'Path Removed w/o Deprecation'],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, 'Request Parameter Removed'],
    [2019, 'August', 20, '2.0.0', 100.0, "Breaking", None, None],
    [2019, 'August', 25, '2.0.0', 200.0, 'Non-breaking', None, None],
]

The list goes in this hierarchy: Year, Month, Day, info_version, API_changes, type1, count, content

I want to generate this hierarchical tree structure for the data:

{
  "name": "2020", # this is year
  "children": [
    {
      "name": "July", # this is month
      "children": [
        {
          "name": "10",   #this is day
          "children": [
            {
              "name": "1.2.0",   # this is info_version
              "value": 52,        # this is value of API_changes(always a number)
              "children": [
                {
                  "name": "Breaking",   # this is type1 column( it is string, it is either Nan or Breaking)
                  "value": 6,                   # this is value of count
                  "children": [
                    {
                      "name": "Path Removed w/o Deprecation",      #this is content column
                      "value": 6        # this is value of count
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

For all other months it continues in the same format.I do not wish to modify my data in any way whatsoever, this is how its supposed to be for my use case( graphical purposes). I am not sure how I could achieve this, any suggestions would be really grateful.

This is in reference to this format for Sunburst graph in pyecharts

score 4 · Answer 1 · answered May 03 '23 at 17:55

First you need to make a nested dict with all the different keys you have, then build your structure recursively

from collections import defaultdict

def to_keys(values):
    if isinstance(values, tuple):
        return {"name": values[0], "value": values[1]}
    return {"name": values}    

def to_children(values):
    if isinstance(values, list):
        return [to_children(item) for item in values]
    if isinstance(values, tuple):
        return to_keys(values)
    if isinstance(values, dict):
        return [{**to_keys(key), "children": to_children(value)}
                for key, value in values.items()]
    raise Exception("invalid type")

gen = lambda: defaultdict(gen)
result = defaultdict(gen)

data = [
    [2019, "July", 10, '1.2.0', 52.0, 'Breaking', 6, None],
    [2019, "July", 10, '1.2.0', 52.0, "Breaking", 6.0, 'Path Removed w/o Deprecation'],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, 'Request Parameter Removed'],
    [2019, 'August', 20, '2.0.0', 100.0, "Breaking", None, None],
    [2019, 'August', 25, '2.0.0', 200.0, 'Non-breaking', None, None],
]

for year, month, day, info_version, api_changes, type1, count, content in data:
    result[year][month][day][(info_version, api_changes)].setdefault((type1, count), []).append((content, count))

final_result = to_children(result)
print(final_result)

The code works perfectly, although a slight error in the generated hierarchy, i get this: `"children": [ {"name": None, "value": 6}`, it adds this before the `path removed without deprecation ` node — Brie MerryWeather, May 03 '23 at 23:54

ftorre · Accepted Answer · 2023-05-04T16:53:12.170

Assuming that headers are known and sorted in hierarchical with description of header that must be grouped order like so (see datetime doc for its usage):

from datetime import datetime
hierarchical_description = [
    ([("name", "Year")], lambda d: int(d["name"])),
    ([("name", "Month")], lambda d: datetime.strptime(d["name"], "%B").month),
    ([("name", "Day")], None),
    ([("name", "info_version"), ("value", "API_changes")], None),
    (
        [
            ("name", "type1"),
            ("value", "count"),
        ],
        None,
    ),
    ([("name", "content"), ("value", "count")], None),
]

And that the dataframe is loaded as follows:

import pandas as pd

data = [
    [2019, "July", 8, "1.2.0", 7.0, None, None],
    [2019, "July", 10, "1.2.0", 52.0, "Breaking", 6.0, "Path Removed w/o Deprecation"],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, "Request Parameter Removed"],
    [2019, "August", 20, "2.0.0", 100.0, "Breaking", None, None],
    [2019, "August", 25, "2.0.0", 200.0, "Non-breaking", None, None],
]

hierarchical_order = [
    "Year",
    "Month",
    "Day",
    "info_version",
    "API_changes",
    "type1",
    "count",
    "content",
]

df = pd.DataFrame(
    data,
    columns=hierarchical_order,
)

It is possible to create a recursive methods that goes hierarchically into the dataframe:

def logical_and_df(df, conditions):
    if len(conditions) == 0:
        return df
    colname, value = conditions[0]
    return logical_and_df(df[df[colname] == value], conditions[1:])


def get_hierarchical_data(df, description):
    if len(description) == 0:
        return []

    children = []
    parent_description, sorting_function_key = description[0]
    for colvalues, subdf in df.groupby([colname for _, colname in parent_description]):
        attributes = {
            key: value for (key, _), value in zip(parent_description, colvalues)
        }
        grand_children = get_hierarchical_data(
            logical_and_df(
                subdf,
                [
                    (colname, value)
                    for (_, colname), value in zip(parent_description, colvalues)
                ],
            ),
            description[1:],
        )
        if len(grand_children) > 0:
            attributes["children"] = grand_children

        children.append(attributes)

    if sorting_function_key is None:
        return children
    return sorted(children, key=sorting_function_key)

The method logical_and takes a dataframe and a list of condition. A condition is a pair where the left member is the column name and the right one is the value on that column.

The recursive method get_hierarchical_data takes the hierarchical description as input. The description, is a list of tuple. Each tuple is composed by a list that indicates the name, value column and a optional sorting key method, that will be used to order the children list. The method returns the children where value / name are based on the first element in the description. If the description is empty, it returns an empty list of children. Otherwise, it uses groupby method from pandas to look for unique pairs (see this post). A name, value dictionary is created and concatenated with the recursive call of the method looking for children.

The following lines help you printing the dictionary:

import json
print(json.dumps(get_hierarchical_data(df, hierarchical_description), indent=5))

Firstly posted version

My first version was not specific to the problem with grouped column. I edited this post to this new version that should solve your issue.

This answer is perfect, just one minor thing I noticed, actually I apologize from my side in the first line of data there was an additional None I forgot. I fixed it and the only thing which is the issue is, it keeps August first then July, and for the `value`, there are double quotations — Brie MerryWeather, May 04 '23 at 11:42
Indeed, concerning the double quotations you can consider removing the str function when creating attributes. For sorting, you can add a sorting step and sorting key method in the description. I will update the answer accordingly. — ftorre, May 04 '23 at 16:48

Generate hierarchical data from pandas df to list

2 Answers2

Firstly posted version