10

I have a massive data array (500k rows) that looks like:

id  value  score
1   20     20
1   10     30
1   15     0
2   12     4
2   3      8
2   56     9
3   6      18
...

As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.

I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id

With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.

table_trunc = table[(table == id).any(1)]
score       = sum(table_trunc[:,2])

Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?

Cleb
  • 25,102
  • 20
  • 116
  • 151
thegreatt
  • 1,339
  • 2
  • 12
  • 18
  • See http://stackoverflow.com/questions/4651683/numpy-grouping-using-itertools-groupby-performance for info about numpy grouping – agf Aug 17 '11 at 08:05

7 Answers7

13

you can use bincount():

import numpy as np

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

print np.bincount(ids, weights=data)

the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.

HYRY
  • 94,853
  • 25
  • 187
  • 187
  • 2
    But this is only applicable for 1d array. The poster wants a solution for 2d array. Is there any solution? – pfc Jun 27 '17 at 09:02
1

If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.

Your sum operation there would look like:

res = aggregate(id, score)
Michael
  • 7,316
  • 1
  • 37
  • 63
1

I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:

import pandas as pd

df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})

So your dataframe would look like this:

  id  score
0   1     20
1   1     30
2   1      0
3   2      4
4   2      8
5   2      9
6   3     18

Now you can use the functions groupby() and sum():

df.groupby(['id'], sort=False).sum()

which gives you the desired output:

    score
id       
1      50
2      21
3      18

By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.

Cleb
  • 25,102
  • 20
  • 116
  • 151
1

You can use a for loop and numba

from numba import njit

@njit
def wbcnt(b, w, k):
    bins = np.arange(k)
    bins = bins * 0
    for i in range(len(b)):
        bins[b[i]] += w[i]
    return bins

Using @HYRY's variables

ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]

Then:

wbcnt(ids, data, 4)

array([ 0, 50, 21, 18])

Timing

%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)

1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:

import numpy_indexed as npi
npi.group_by(id).sum(score)
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
0

You can try using boolean operations:

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

[((ids == i)*data).sum() for i in np.unique(ids)]

This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.

dtlussier
  • 3,018
  • 2
  • 26
  • 22
-1

Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.

(The data must be sorted according to the group by func, in this case ID)

>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
        for y in i:
            if isinstance(y, int):
                print(y)
            else:
                for p in y:
                    print('-', p)

Output:

1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)
roqvist
  • 327
  • 5
  • 12
  • I think this is unlikely to be fast because it does the work in Python instead of in C like if you do it in `numpy`? – agf Aug 17 '11 at 11:16