How to aggregate NumPy record array (sum, min, max, etc.)?

Question

Consider a simple record array structure:

import numpy as np
ijv_dtype = [
    ('I', 'i'),
    ('J', 'i'),
    ('v', 'd'),
]
ijv = np.array([
    (0, 0, 3.3),
    (0, 1, 1.1),
    (0, 1, 4.4),
    (1, 1, 2.2),
    ], ijv_dtype)
print(ijv)  # [(0, 0, 3.3) (0, 1, 1.1) (0, 1, 4.4) (1, 1, 2.2)]

I'd like to aggregate certain statistics (sum, min, max, etc.) from v by grouping unique combinations of I and J. Thinking from SQL, the expected result is:

select i, j, sum(v) as v from ijv group by i, j;
 i | j |  v
---+---+-----
 0 | 0 | 3.3
 0 | 1 | 5.5
 1 | 1 | 2.2

(the order is not important)

The best I can think up for NumPy is ugly, and I'm not confident I've ordered the result correctly (although it seems to work here):

# Get unique groups, index and inverse
u_ij, idx_ij, inv_ij = np.unique(ijv[['I', 'J']], return_index=True, return_inverse=True)
# Assemble aggregate
a_ijv = np.zeros(len(u_ij), ijv_dtype)
a_ijv['I'] = u_ij['I']
a_ijv['J'] = u_ij['J']
a_ijv['v'] = [ijv['v'][inv_ij == i].sum() for i in range(len(u_ij))]
print(a_ijv)  # [(0, 0, 3.3) (0, 1, 5.5) (1, 1, 2.2)]

I'd like to think there is a better way to do this! I'm using NumPy 1.4.1.

My first try would be to collect the data in a `collections.default_dict(list)`, using `(i,j)` tuples as keys. Then I could preform the needed statistics on each of the lists. — hpaulj, Oct 09 '15 at 02:09

score 1 · Answer 1 · answered Oct 09 '15 at 01:58

numpy is a bit too low-level for tasks like this. I think your solution is fine if you have to use pure numpy, but if you don't mind using something with higher level of abstraction, try pandas:

import pandas as pd

df = pd.DataFrame({
    'I': (0, 0, 0, 1),
    'J': (0, 1, 1, 1),
    'v': (3.3, 1.1, 4.4, 2.2)})

print(df)
print(df.groupby(['I', 'J']).sum())

Output:

   I  J    v
0  0  0  3.3
1  0  1  1.1
2  0  1  4.4
3  1  1  2.2
       v
I J     
0 0  3.3
  1  5.5
1 1  2.2

With an early `numpy` version, `pandas` might not be an option. — hpaulj, Oct 09 '15 at 02:07

score 1 · Answer 2 · answered Nov 05 '19 at 16:03

It's not a huge step up from what you have already but it at least gets rid of the for loop.

# Starting with your original setup

# Get the unique ij values and the mapping from ungrouped to grouped.
u_ij, inv_ij = np.unique(ijv[['I', 'J']], return_inverse=True)

# Create a totals array. You could do the fancy ijv_dtype thing if you wanted.
totals = np.zeros_like(u_ij.shape)

# Here's the magic bit. You can think of it as 
# totals[inv_ij] += ijv["v"] 
# except the above doesn't behave as expected sadly.
np.add.at(totals, inv_ij, ijv["v"])

print(totals)

The fact that you are using numpy's multi-dtype thing is a bit of an indicator you should be using pandas. It generally makes for less awkward code when trying to keep your is, js and vs together.

How to aggregate NumPy record array (sum, min, max, etc.)?

2 Answers2