1

I have a list of ~12K dictionaries. Each dictionary has the same keys: year, code and category.

L = [{"year": "2015", "code": "VU", "category": "Vulnerable"}, {"year": "2008", "code": "VU", "category": "Vulnerable"}, {"year": "2004", "code": "LC", "category": "Least Concern"}]

I'm trying to create a new dictionary that will have, as key, each value of code and, as the value to that key, a list of unique years for each code (I don't necessarily need the category key-value pair):

{"VU": {2008, 2015}, "LC": {2004}}

I created a dictionary codes_dict with the correct codes as keys, and empty sets as values (since I don't want duplicates, and I really only need the earliest and latest years.)

codes = (e['code'] for e in L)
codes_dict = dict.fromkeys(codes, set())

for e in L:
    codes_dict[e['code']].add(e['year'])

However, when I try to populate the values, I get every year added to every code:

{'VU': {'2004', '2008', '2015'}, 'LC': {'2004', '2008', '2015'}}

What am I missing? I tried using a list instead of a set and got the same result (with duplicates). Also using = instead of add() means only the last value is added, whereas I want the whole range.

Performance isn't really an issue, as this is just supposed to be a quick diagnostic.

Bonus: if there is a better way to do this in pandas, I'd love to hear it.

Thanks!

2 Answers2

1

You can use dataframe using the following one-liner.

output = {k: list(filter(lambda x: isinstance(x, str) or not numpy.isnan(x), v)) for k, v in pandas.DataFrame(L).pivot(columns='code', values='year').to_dict('list').items()}

Breaking it down:

  1. Load list into dataframe
df=pandas.DataFrame(L)
  1. Create a pivot table where the different codes are columns and each cell contain the year corresponding to the code.
pivot_table = df.pivot(columns='code', values='year')

The result should look like:

       VU    LC
0     2008   NaN
1     2015   NaN
2     NaN    2004     
  1. Output dict of list from dataframe
output = pivot_table.to_dict('list')

The result should look like:

{‘VU’: [‘2008’, ‘2015’, NaN], ‘VL’: [NaN, NaN, ‘2004’]}
  1. This is almost there. Just need to filter out the NaNs.
output = {k: list(filter(lambda x: isinstance(x, str) or not numpy.isnan(x), v)) for k, v in output.items()}

This uses filter function to filter out non string and NaN objects. Then uses dictionary comprehension to reconstruct the dictionary without the NaNs.

Tim
  • 3,178
  • 1
  • 13
  • 26
  • Thanks, @Tim. My pandas knowledge is still fairly beginner so this isn't easily parseable for me. If you can, a few words walking me through the expression would be very appreciated. – panopticonopolis Dec 27 '19 at 23:19
  • Thanks, that's supremely helpful. Lots of great new methods to wrap my head around. – panopticonopolis Dec 28 '19 at 20:54
1

In your code, all of your values point to the same set. Try instead (using a defaultdict; you could instead use get and set each element to be a new set if it doesn't yet exist)

from collections import defaultdict

L = [{"year": "2015", "code": "VU", "category": "Vulnerable"}, {"year": "2008", "code": "VU", "category": "Vulnerable"}, {"year": "2004", "code": "LC", "category": "Least Concern"}]


codes_dict = defaultdict(set)
for e in L:
    codes_dict[e['code']].add(e['year'])

print(dict(codes_dict))
kopecs
  • 1,545
  • 10
  • 20