Most efficient way to sum huge 2D NumPy array, grouped by ID column?

Question

I have a massive data array (500k rows) that looks like:

id  value  score
1   20     20
1   10     30
1   15     0
2   12     4
2   3      8
2   56     9
3   6      18
...

As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.

I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id

With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.

table_trunc = table[(table == id).any(1)]
score       = sum(table_trunc[:,2])

Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?

See http://stackoverflow.com/questions/4651683/numpy-grouping-using-itertools-groupby-performance for info about numpy grouping — agf, Aug 17 '11 at 08:05

score 13 · Accepted Answer · answered Aug 17 '11 at 08:01

13

you can use bincount():

import numpy as np

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

print np.bincount(ids, weights=data)

the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.

answered Aug 17 '11 at 08:01

HYRY

94,853
25
187
187

2

But this is only applicable for 1d array. The poster wants a solution for 2d array. Is there any solution? – pfc Jun 27 '17 at 09:02

score 1 · Answer 2 · answered Oct 20 '16 at 15:31

If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.

Your sum operation there would look like:

res = aggregate(id, score)

Cleb · Answer 3 · 2017-09-07T10:07:51.267

I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:

import pandas as pd

df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})

So your dataframe would look like this:

  id  score
0   1     20
1   1     30
2   1      0
3   2      4
4   2      8
5   2      9
6   3     18

Now you can use the functions groupby() and sum():

df.groupby(['id'], sort=False).sum()

which gives you the desired output:

By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.

score 1 · Answer 4 · answered Jun 09 '17 at 22:10

You can use a for loop and numba

from numba import njit

@njit
def wbcnt(b, w, k):
    bins = np.arange(k)
    bins = bins * 0
    for i in range(len(b)):
        bins[b[i]] += w[i]
    return bins

Using @HYRY's variables

ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]

Then:

wbcnt(ids, data, 4)

array([ 0, 50, 21, 18])

Timing

%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)

1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop

score 0 · Answer 5 · answered Oct 27 '16 at 12:41

0

The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:

import numpy_indexed as npi
npi.group_by(id).sum(score)

answered Oct 27 '16 at 12:41

Eelco Hoogendoorn

10,459
1
44
42

score 0 · Answer 6 · answered Aug 17 '11 at 18:37

You can try using boolean operations:

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

[((ids == i)*data).sum() for i in np.unique(ids)]

This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.

roqvist · Answer 7 · 2011-08-17T08:58:39.700

-1

Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.

(The data must be sorted according to the group by func, in this case ID)

>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
        for y in i:
            if isinstance(y, int):
                print(y)
            else:
                for p in y:
                    print('-', p)

Output:

1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)

edited Aug 17 '11 at 08:58

answered Aug 17 '11 at 08:51

roqvist

327
5
12

I think this is unlikely to be fast because it does the work in Python instead of in C like if you do it in `numpy`? – agf Aug 17 '11 at 11:16

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

7 Answers7

Linked

Related