12

I have two lists of dictionaries (returned as Django querysets). Each dictionary has an ID value. I'd like to merge the two into a single list of dictionaries, based on the ID value.

For example:

list_a = [{'user__name': u'Joe', 'user__id': 1},
          {'user__name': u'Bob', 'user__id': 3}]
list_b = [{'hours_worked': 25, 'user__id': 3},
          {'hours_worked': 40, 'user__id': 1}]

and I want a function to yield:

list_c = [{'user__name': u'Joe', 'user__id': 1, 'hours_worked': 40},
          {'user__name': u'Bob', 'user__id': 3, 'hours_worked': 25}]

Additional points to note:

  • The IDs in the lists may not be in the same order (as with the example above).
  • The lists will probably have the same number of elements, but I want to account for the option if they're not but keeping all the values from list_a (essentially list_a OUTER JOIN list_b USING user__id).
  • I've tried doing this in SQL but it's not possible since some of the values are aggregates based on some exclusions.
  • It's safe to assume there will only be at most one dictionary with the same user__id in each list due to the database queries used.

Many thanks for your time.

edkay
  • 181
  • 1
  • 8

2 Answers2

19

I'd use itertools.groupby to group the elements:

lst = sorted(itertools.chain(list_a,list_b), key=lambda x:x['user__id'])
list_c = []
for k,v in itertools.groupby(lst, key=lambda x:x['user__id']):
    d = {}
    for dct in v:
        d.update(dct)
    list_c.append(d)
    #could also do:
    #list_c.append( dict(itertools.chain.from_iterable(dct.items() for dct in v)) )
    #although that might be a little harder to read.

If you have an aversion to lambda functions, you can always use operator.itemgetter('user__id') instead. (it's probably slightly more efficient too)

To demystify lambda/itemgetter a little bit, Note that:

def foo(x):
    return x['user__id']

is the same thing* as either of the following:

foo = operator.itemgetter('user__id')
foo = lambda x: x['user__id']

*There are a few differences, but they're not important for this problem

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • [`operator.itemgetter()`](http://docs.python.org/3/library/operator.html#operator.itemgetter) might be a good call here. – Gareth Latty Dec 20 '12 at 15:11
  • one-liner `[dict(y for x in g for y in x.items()) for k,g in groupby(lis,key=lambda x:x['user__id'])]` – Ashwini Chaudhary Dec 20 '12 at 15:17
  • Great solution, but worth noting that this will trample all but the last value in the result set for the same `user_id` if there are multiple rows for that `user_id` that contain the same value key. Probably fine for this question, but could be a tricky problem if it is a concern. – Silas Ray Dec 20 '12 at 15:19
  • 1
    @sr2222 -- You're right, it will do that, but if that is a concern, then this isn't a well-posed problem (OP never said how that should be handled) :) – mgilson Dec 20 '12 at 15:20
  • Well, the specs he provides don't explicitly state that such a condition can't occur, but given the type of data he appears to be working with, it's probably a safe assumption that there's no duplicates. – Silas Ray Dec 20 '12 at 15:21
  • Wow, impressed by the number and speed of responses here. Very much appreciated. I've tried the code originally suggested by @mgilson and it works a charm. Now to do a bit more reading to fully understand how it works :) – edkay Dec 20 '12 at 15:29
  • @sr2222 -- Sure he doesn't specify that it can't occur (maybe it can). But he doesn't specify *how* it should be handled should the case arise. And that's not something that I think we could reasonably guess (as far as I can see it, keeping the last one is just as good of a way to handle it as anything else). – mgilson Dec 20 '12 at 15:29
  • @sr2222 Good shout. Thankfully for this situation there won't be any duplicate `user__id` value keys due to the db query used. – edkay Dec 20 '12 at 15:31
  • Sorting, grouping, and itemgetter all seem like unnecessary overhead for some dicts. – Marcin Dec 20 '12 at 15:44
  • @Marcin -- Maybe. `grouping` really doesn't introduce any more overhead than your simple for loop. `itemgetter` doesn't introduce much more overhead than is already present in `__getitem__`, so `sorting` is the only stage which is really "unnecessary" here. However, if OP wants to have a list at the end of the day, it's possible that having a sorted list is desirable in which case OP would need to sort your output as well. (that said, your output would be smaller, so it would be a faster sort). Anyway, yours is a nice answer. +1 to it. – mgilson Dec 20 '12 at 15:52
  • @mgilson Quite. It's not just computational overhead, but also simple code length and readability. – Marcin Dec 20 '12 at 15:59
6
from collections import defaultdict
from itertools import chain

list_a = [{'user__name': u'Joe', 'user__id': 1},
      {'user__name': u'Bob', 'user__id': 3}]
list_b = [{'hours_worked': 25, 'user__id': 3},
      {'hours_worked': 40, 'user__id': 1}]

collector = defaultdict(dict)

for collectible in chain(list_a, list_b):
    collector[collectible['user__id']].update(collectible.iteritems())

list_c = list(collector.itervalues())

As you can see, this just uses another dict to merge the existing dicts. The trick with defaultdict is that it takes out the drudgery of creating a dict for a new entry.

There is no need to group or sort these inputs. The dict takes care of all of that.

A truly bulletproof solution would catch the potential key error in case the input does not have a 'user__id' key, or use a default value to collect up all of the dicts without such a key.

Marcin
  • 48,559
  • 18
  • 128
  • 201