There are a couple of interesting solutions that don't depend on groupby
. The first is really simple:
def apply_to_bins(func, values, bins):
return zip(*((bin, func(values[bins == bin])) for bin in set(bins)))
This uses "fancy indexing" instead of grouping, and performs reasonably well for small inputs; a list-comprehension-based variation does a bit better (see below for timings).
def apply_to_bins2(func, values, bins):
bin_names = sorted(set(bins))
return bin_names, [func(values[bins == bin]) for bin in bin_names]
These have the advantage of being pretty readable. Both also fare better than groupby
for small inputs, but they get much slower for large inputs, especially when there are many bins; their performance is O(n_items * n_bins)
. A different numpy
-based approach is slower for small inputs, but much faster for large inputs, and especially so for large inputs with lots of bins:
def apply_to_bins3(func, values, bins):
bins_argsort = bins.argsort()
values = values[bins_argsort]
bins = bins[bins_argsort]
group_indices = (bins[1:] != bins[:-1]).nonzero()[0] + 1
groups = numpy.split(values, group_indices)
return numpy.unique(bins), [func(g) for g in groups]
Some tests. First for small inputs:
>>> def apply_to_bins_groupby(func, x, b):
... return zip(*[(k, np.product(x[list(v)]))
... for k, v in groupby(np.argsort(b), key=lambda i: b[i])])
...
>>> x = numpy.array([1, 2, 3, 4, 5, 6])
>>> b = numpy.array(['a', 'b', 'a', 'a', 'c', 'c'])
>>>
>>> %timeit apply_to_bins(numpy.prod, x, b)
10000 loops, best of 3: 31.9 us per loop
>>> %timeit apply_to_bins2(numpy.prod, x, b)
10000 loops, best of 3: 29.6 us per loop
>>> %timeit apply_to_bins3(numpy.prod, x, b)
10000 loops, best of 3: 122 us per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
10000 loops, best of 3: 67.9 us per loop
The apply_to_bins3
doesn't fare too well here, but it's still less than an order of magnitude slower than the fastest. It does better when n_items
gets larger:
>>> x = numpy.arange(1, 100000)
>>> b_names = numpy.array(['a', 'b', 'c', 'd'])
>>> b = b_names[numpy.random.random_integers(0, 3, 99999)]
>>>
>>> %timeit apply_to_bins(numpy.prod, x, b)
10 loops, best of 3: 27.8 ms per loop
>>> %timeit apply_to_bins2(numpy.prod, x, b)
10 loops, best of 3: 27 ms per loop
>>> %timeit apply_to_bins3(numpy.prod, x, b)
100 loops, best of 3: 13.7 ms per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
10 loops, best of 3: 124 ms per loop
And when n_bins
goes up, the first two approaches take too long to bother showing here -- around five seconds. apply_to_bins3
is the clear winner here.
>>> x = numpy.arange(1, 100000)
>>> bn_product = product(['a', 'b', 'c', 'd', 'e'], repeat=5)
>>> b_names = numpy.array(list(''.join(s) for s in bn_product))
>>> b = b_names[numpy.random.random_integers(0, len(b_names) - 1, 99999)]
>>>
>>> %timeit apply_to_bins3(numpy.prod, x, b)
10 loops, best of 3: 109 ms per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
1 loops, best of 3: 205 ms per loop
Overall, groupby
is probably fine in most cases, but is unlikely to scale well, as suggested by this thread. Using a pure(er) numpy
approach, is slower for small inputs, but only by a bit; the tradeoff is a good one.