Aggregating values from one dictionary to populate another in Python

Question

I have a list of ~12K dictionaries. Each dictionary has the same keys: year, code and category.

L = [{"year": "2015", "code": "VU", "category": "Vulnerable"}, {"year": "2008", "code": "VU", "category": "Vulnerable"}, {"year": "2004", "code": "LC", "category": "Least Concern"}]

I'm trying to create a new dictionary that will have, as key, each value of code and, as the value to that key, a list of unique years for each code (I don't necessarily need the category key-value pair):

{"VU": {2008, 2015}, "LC": {2004}}

I created a dictionary codes_dict with the correct codes as keys, and empty sets as values (since I don't want duplicates, and I really only need the earliest and latest years.)

codes = (e['code'] for e in L)
codes_dict = dict.fromkeys(codes, set())

for e in L:
    codes_dict[e['code']].add(e['year'])

However, when I try to populate the values, I get every year added to every code:

{'VU': {'2004', '2008', '2015'}, 'LC': {'2004', '2008', '2015'}}

What am I missing? I tried using a list instead of a set and got the same result (with duplicates). Also using = instead of add() means only the last value is added, whereas I want the whole range.

Performance isn't really an issue, as this is just supposed to be a quick diagnostic.

Bonus: if there is a better way to do this in pandas, I'd love to hear it.

Thanks!

@kopecs: Thanks, I just edited the question with that clarification. — panopticonopolis, Dec 27 '19 at 20:32
Does this answer your question [dict.fromkeys all point to same list](https://stackoverflow.com/questions/15516413/dict-fromkeys-all-point-to-same-list)? It seems to be the same root issue. — kopecs, Dec 27 '19 at 20:36
@kopecs It's useful for the analysis but I really like the defaultdict solution you gave below. — panopticonopolis, Dec 28 '19 at 00:44

Tim · Answer 1 · 2019-12-28T01:26:27.840

You can use dataframe using the following one-liner.

output = {k: list(filter(lambda x: isinstance(x, str) or not numpy.isnan(x), v)) for k, v in pandas.DataFrame(L).pivot(columns='code', values='year').to_dict('list').items()}

Breaking it down:

Load list into dataframe

df=pandas.DataFrame(L)

Create a pivot table where the different codes are columns and each cell contain the year corresponding to the code.

pivot_table = df.pivot(columns='code', values='year')

The result should look like:

       VU    LC
0     2008   NaN
1     2015   NaN
2     NaN    2004

Output dict of list from dataframe

output = pivot_table.to_dict('list')

The result should look like:

{‘VU’: [‘2008’, ‘2015’, NaN], ‘VL’: [NaN, NaN, ‘2004’]}

This is almost there. Just need to filter out the NaNs.

output = {k: list(filter(lambda x: isinstance(x, str) or not numpy.isnan(x), v)) for k, v in output.items()}

This uses filter function to filter out non string and NaN objects. Then uses dictionary comprehension to reconstruct the dictionary without the NaNs.

Thanks, @Tim. My pandas knowledge is still fairly beginner so this isn't easily parseable for me. If you can, a few words walking me through the expression would be very appreciated. — panopticonopolis, Dec 27 '19 at 23:19
Thanks, that's supremely helpful. Lots of great new methods to wrap my head around. — panopticonopolis, Dec 28 '19 at 20:54

score 1 · Accepted Answer · answered Dec 27 '19 at 20:36

In your code, all of your values point to the same set. Try instead (using a defaultdict; you could instead use get and set each element to be a new set if it doesn't yet exist)

from collections import defaultdict

L = [{"year": "2015", "code": "VU", "category": "Vulnerable"}, {"year": "2008", "code": "VU", "category": "Vulnerable"}, {"year": "2004", "code": "LC", "category": "Least Concern"}]


codes_dict = defaultdict(set)
for e in L:
    codes_dict[e['code']].add(e['year'])

print(dict(codes_dict))

Aggregating values from one dictionary to populate another in Python

2 Answers2