2

Consider a simple record array structure:

import numpy as np
ijv_dtype = [
    ('I', 'i'),
    ('J', 'i'),
    ('v', 'd'),
]
ijv = np.array([
    (0, 0, 3.3),
    (0, 1, 1.1),
    (0, 1, 4.4),
    (1, 1, 2.2),
    ], ijv_dtype)
print(ijv)  # [(0, 0, 3.3) (0, 1, 1.1) (0, 1, 4.4) (1, 1, 2.2)]

I'd like to aggregate certain statistics (sum, min, max, etc.) from v by grouping unique combinations of I and J. Thinking from SQL, the expected result is:

select i, j, sum(v) as v from ijv group by i, j;
 i | j |  v
---+---+-----
 0 | 0 | 3.3
 0 | 1 | 5.5
 1 | 1 | 2.2

(the order is not important)

The best I can think up for NumPy is ugly, and I'm not confident I've ordered the result correctly (although it seems to work here):

# Get unique groups, index and inverse
u_ij, idx_ij, inv_ij = np.unique(ijv[['I', 'J']], return_index=True, return_inverse=True)
# Assemble aggregate
a_ijv = np.zeros(len(u_ij), ijv_dtype)
a_ijv['I'] = u_ij['I']
a_ijv['J'] = u_ij['J']
a_ijv['v'] = [ijv['v'][inv_ij == i].sum() for i in range(len(u_ij))]
print(a_ijv)  # [(0, 0, 3.3) (0, 1, 5.5) (1, 1, 2.2)]

I'd like to think there is a better way to do this! I'm using NumPy 1.4.1.

Mike T
  • 41,085
  • 18
  • 152
  • 203
  • 1
    My first try would be to collect the data in a `collections.default_dict(list)`, using `(i,j)` tuples as keys. Then I could preform the needed statistics on each of the lists. – hpaulj Oct 09 '15 at 02:09

2 Answers2

1

numpy is a bit too low-level for tasks like this. I think your solution is fine if you have to use pure numpy, but if you don't mind using something with higher level of abstraction, try pandas:

import pandas as pd

df = pd.DataFrame({
    'I': (0, 0, 0, 1),
    'J': (0, 1, 1, 1),
    'v': (3.3, 1.1, 4.4, 2.2)})

print(df)
print(df.groupby(['I', 'J']).sum())

Output:

   I  J    v
0  0  0  3.3
1  0  1  1.1
2  0  1  4.4
3  1  1  2.2
       v
I J     
0 0  3.3
  1  5.5
1 1  2.2
fjarri
  • 9,546
  • 39
  • 49
1

It's not a huge step up from what you have already but it at least gets rid of the for loop.

# Starting with your original setup

# Get the unique ij values and the mapping from ungrouped to grouped.
u_ij, inv_ij = np.unique(ijv[['I', 'J']], return_inverse=True)

# Create a totals array. You could do the fancy ijv_dtype thing if you wanted.
totals = np.zeros_like(u_ij.shape)

# Here's the magic bit. You can think of it as 
# totals[inv_ij] += ijv["v"] 
# except the above doesn't behave as expected sadly.
np.add.at(totals, inv_ij, ijv["v"])

print(totals)

The fact that you are using numpy's multi-dtype thing is a bit of an indicator you should be using pandas. It generally makes for less awkward code when trying to keep your is, js and vs together.

pullmyteeth
  • 462
  • 5
  • 12