3

Given the following

policies = [
    {"feature_1": "A", "feature_2": "London", "feature_3": 1000, "feature_4": 10},
    {"feature_1": "A", "feature_2": "London", "feature_3": 2000, "feature_4": 20},
    {"feature_1": "B", "feature_2": "Dublin", "feature_3": 3000, "feature_4": 30},
    {"feature_1": "B", "feature_2": "Dublin", "feature_3": 4000, "feature_4": 40},
    {"feature_1": "A", "feature_2": "London", "feature_3": 5000, "feature_4": 50},
    {"feature_1": "C", "feature_2": "London", "feature_3": 6000, "feature_4": 60}
]

I want to pass the above dict and two lists

group_fields = ["feature_1", "feature_2"]
sum_fields = ["feature_3", "feature_4"]

and get back

[{'feature_1': 'A', 'feature_2': 'London', 'feature_3': 8000, 'feature_4': 80},
 {'feature_1': 'B', 'feature_2': 'Dublin', 'feature_3': 7000, 'feature_4': 70},
 {'feature_1': 'C', 'feature_2': 'London', 'feature_3': 6000, 'feature_4': 60}]

So it has grouped over the group_fields and summed over the sum_fields (both of which are subject to change)

This is closely related to Group by multiple keys and summarize/average values of a list of dictionaries but I had problems generalising this approach to my problem.

from itertools import groupby
from operator import itemgetter
from pprint import pprint    
grouper = itemgetter(*group_fields)

result = []
for key, grp in groupby(sorted(policies, key=grouper), grouper):   
    temp_dict = dict(zip(group_fields, key))
    
    group_tuple = [(item["feature_3"], item["feature_4"]) for item in grp]
    
    temp_dict["feature_3"] = sum([item[0] for item in group_tuple])
    temp_dict["feature_4"] = sum([item[1] for item in group_tuple])

    result.append(temp_dict)

pprint(result)   

This does work but I have had to hardcode feature_3 and feature_4. I can't figure out how to abstract that out so the only place I type those features in is within the sum_fields variable. I also don't like I have to sum multiple times over group_tuple to get my values out. Can someone please help?

Thanks

Almog
  • 452
  • 1
  • 7
  • 13

2 Answers2

2

You can use some zip and map magic to sum all the fields in one line you just need to make another itemgetter for the fields too:

grouper = itemgetter(*group_fields)
field_grouper = itemgetter(*sum_fields)

result = []
for key, grp in groupby(sorted(policies, key=grouper), grouper):   
    temp_dict = dict(zip(group_fields, key))
    #line below has all the magic in it
    temp_dict.update(dict(zip(sum_fields, map(sum, zip(*map(field_grouper, grp))))))

    result.append(temp_dict)

pprint(result) 

All it's doing is summing the fields by zipping them together. Then zipping the field names to their appropriate sums then turning them into a dict so it can be updated to the temp_dict

Result:

[{'feature_1': 'A', 'feature_2': 'London', 'feature_3': 8000, 'feature_4': 80},
 {'feature_1': 'B', 'feature_2': 'Dublin', 'feature_3': 7000, 'feature_4': 70},
 {'feature_1': 'C', 'feature_2': 'London', 'feature_3': 6000, 'feature_4': 60}]

Try it here: https://repl.it/repls/ImaginaryDarkCable

Jab
  • 26,853
  • 21
  • 75
  • 114
  • This is ok, but I notice that I get strange behaviour if either variable has length 1, so if I set group_fields = ["feature_2"] the grouping kind of works but the values for the feature 2 key are now "D" and "L" instead of "Dublin" and "London". If I set sum_fields=["feature_3"] it breaks. Do I need edge cases for where either variable has length 1? – Colin Menzies Aug 22 '20 at 11:56
1

You can use pandas:

policies = [
    {"feature_1": "A", "feature_2": "London", "feature_3": 1000, "feature_4": 10},
    {"feature_1": "A", "feature_2": "London", "feature_3": 2000, "feature_4": 20},
    {"feature_1": "B", "feature_2": "Dublin", "feature_3": 3000, "feature_4": 30},
    {"feature_1": "B", "feature_2": "Dublin", "feature_3": 4000, "feature_4": 40},
    {"feature_1": "A", "feature_2": "London", "feature_3": 5000, "feature_4": 50},
    {"feature_1": "C", "feature_2": "London", "feature_3": 6000, "feature_4": 60}
]

import pandas as pd
df = pd.DataFrame.from_dict(policies)

group_fields = ["feature_1", "feature_2"]
sum_fields = ["feature_3", "feature_4"]


dict_sum_fields = {x:sum for x in sum_fields}
df_grouped = df.groupby(group_fields, as_index=False).aggregate(dict_sum_fields)

print(df_grouped.to_dict('records'))

Out[18]: 
[{'feature_1': 'A', 'feature_2': 'London', 'feature_3': 8000, 'feature_4': 80},
 {'feature_1': 'B', 'feature_2': 'Dublin', 'feature_3': 7000, 'feature_4': 70},
 {'feature_1': 'C', 'feature_2': 'London', 'feature_3': 6000, 'feature_4': 60}]
Andreas
  • 8,694
  • 3
  • 14
  • 38
  • This does work, and doesn't run into the problems I mention above if group_fields or sum_fields has length 1. I suspect there will be performance issues with this solution however as I am converting into pandas and the tables I will be implementing the solution will be large. – Colin Menzies Aug 22 '20 at 12:00
  • @ColinMenzies have you tried it with your data? Pandas is usually really performant, if you still have problems, try Dask which works very well with Pandas. Also it really depends on where the dictionaries come from, because pandas has a lot of formats it can directly read, like .pkl, json etc. so maybe no need to temporarly store the data as alist of dictionaries but instead load the data directly into pandas. – Andreas Aug 22 '20 at 13:41
  • The list of dicts is non-negotiable! But I tested with a randomly generated list of 1m dicts and this takes approx 3.5 seconds versus the other answer of 4.5 seconds (which doesn't fully work), so I'm learning some other stuff here and this answer is a good answer. – Colin Menzies Aug 23 '20 at 11:35
  • Glad I could help, Happy Coding! – Andreas Aug 23 '20 at 13:54