30

I have this data:

self.data = [(1, 1, 5.0),
             (1, 2, 3.0),
             (1, 3, 4.0),
             (2, 1, 4.0),
             (2, 2, 2.0)]

When I run this code:

for mid, group in itertools.groupby(self.data, key=operator.itemgetter(0)):

for list(group) I get:

[(1, 1, 5.0),
 (1, 2, 3.0),
 (1, 3, 4.0)]

which is what I want.

But if I use 1 instead of 0

for mid, group in itertools.groupby(self.data, key=operator.itemgetter(1)):

to group by the second number in the tuples, I only get:

[(1, 1, 5.0)]

even though there are other tuples that have "1" in that 1 (2nd) position.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
user994165
  • 9,146
  • 30
  • 98
  • 165

3 Answers3

57

itertools.groupby collects together contiguous items with the same key. If you want all items with the same key, you have to sort self.data first.

for mid, group in itertools.groupby(
    sorted(self.data,key=operator.itemgetter(1)), key=operator.itemgetter(1)):
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I had sorted previously on position zero. So I just sorted again before doing the groupby and it works. self.data.sort(key=operator.itemgetter(1)) – user994165 Nov 14 '11 at 02:41
  • 6
    There is no need to sort; you want to use a *dictionary* instead: `grouped = {}` then `for v in self.data: grouped.setdefault(v[1], []).append(v)`. Sorting is an O(NlogN) operation, where using a dictionary to group the values lets you complete the task in O(N) time. – Martijn Pieters Sep 08 '19 at 21:15
34

Variant without sorting (via dictionary). Should be better performance-wise.

def full_group_by(l, key=lambda x: x):
    d = defaultdict(list)
    for item in l:
        d[key(item)].append(item)
    return d.items()
Konstantine Rybnikov
  • 2,457
  • 1
  • 22
  • 29
  • Came back to post the same thing, I hadn't read your answer! This is clearly the way to go :) – Andy Hayden Jul 31 '16 at 18:24
  • 2
    Unfortunately the keys all have to be hashable then, so it doesn't work if those are for examples lists, unlike with `itertools.groupby`... – Jeronimo Aug 22 '18 at 08:11
  • 3
    @Jeronimo: you'd try to find a hashable reflection of the key; say `tuple()` for a list key or `frozenset(d.items())` for dictionaries, etc. If that's really not possible only then would you have to fall back to the O(NlogN) price of sorting. Using a dictionary to group lets you complete the task in linear (O(N)) time. – Martijn Pieters Sep 08 '19 at 22:15
1

Below "fixes" several annoyances with Python's itertools.groupby.

def groupby2(l, key=lambda x:x, val=lambda x:x, agg=lambda x:x, sort=True):
    if sort:
        l = sorted(l, key=key)
    return ((k, agg((val(x) for x in v))) \
        for k,v in itertools.groupby(l, key=key))

Specifically,

  1. It doesn't require that you sort your data.
  2. It doesn't require that you must use key as named parameter only.
  3. The output is clean generator of tuple(key, grouped_values) where values are specified by 3rd parameter.
  4. Ability to apply aggregation functions like sum or avg easily.

Example Usage

import itertools
from operator import itemgetter
from statistics import *

t = [('a',1), ('b',2), ('a',3)]
for k,v in groupby2(t, itemgetter(0), itemgetter(1), sum):
  print(k, v)

This prints,

a 4
b 2

Play with this code

Shital Shah
  • 63,284
  • 17
  • 238
  • 185
  • 2
    Why are those 'annoyances'? `groupby()` lets you combine **consecutive matching values* into groups, it was never intended to group across a whole series, which requires reading every value in the input iterable. The core usecase of the `itertools` module is to avoid consuming all values of an iterator, where possible. – Martijn Pieters Sep 08 '19 at 22:18
  • 3
    Note that sorting has a cost: it takes O(NlogN) time to sort N items into a sorted sequence. Grouping *using a dictionary* on the other hand takes linear time (O(N)). Your 'utility function' removes the option of avoiding paying the sorting cost., and because you are not using keyword-only arguments, anyone reading your `group2()` calls will have to refer to the documentation each time to figure out what all the arguments do. – Martijn Pieters Sep 08 '19 at 22:21
  • Your `t` would be better processed using `from collections import defaultdict`, `summed = defaultdict(int)`, `for k, v in t: summed[k] += v`, `for k, v in summed: print(k, v)`. That's *much more self-evident* as to what the code achieves, and does so in linear time, no sorting needed. – Martijn Pieters Sep 08 '19 at 22:24
  • @MartijnPieters The example is just for demo. There are certainly more efficient ways to do this. – Shital Shah Sep 09 '19 at 21:05
  • See also: [more_itertools.groupby_transform(iterable, keyfunc=None, valuefunc=None, reducefunc=None)](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.groupby_transform). `keyfunc` is similar to your `key`, `valuefunc` is similar to your `val`, and `reducefunc` is similar to your `agg`. – Stef Oct 28 '21 at 12:38
  • I'd argue these "annoyances" are quite legitimate. If you look at comparable utilities in other languages (say .NET LINQ's `GroupBy`), they do not have this requirement. I do not think it is a stretch to suggest that most users of this function are applying it to unordered collections, and thus sorting solely out of necessity. Given the default behavior is now essentially locked in, a keyword argument is a reasonable way to expose this. – Siddhartha Gandhi Jul 31 '22 at 00:38