1

I have two arrays of incidies with shape m. I need to take the mean of the values inbetween the indicies from an array with shape m x n. Can this be done without iterating through each row? What is the fastest way to do this?

idx0 = np.array([1, 3, 2, 5])
idx1 = np.array([5, 8, 6, 7])
d = np.array([[1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9]])

Desired result:

<< d[np.arange(d.shape[0]), idx0:idx1]
>> np.array([3.5, 6, 4.5, 6.5])

The best ways I've found to do this are list comprehension or using a for loop with numba, but this seems like it should be a common problem? Thanks.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
dotto
  • 45
  • 5
  • I don't understand what you mean by "mean of the values in between the indices". Can you explain how the desired results derive from the inputs? – Barmar May 08 '23 at 21:04
  • For the first pair, (1, 5), I'm taking (2+3+4+5)/4 = 3.5 – dotto May 08 '23 at 21:25

3 Answers3

1

In your case you'd still need to get slices of pairwised ranges of indices from idx0 and idx1.
zip + slice is a straigtforward way:

[d[:, slice(*a)].mean() for a in zip(idx0, idx1)]

[3.5, 6.0, 4.5, 6.5]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
1

If you work with d.ravel() instead of d, things will be easier, because now you can find the sums of elements using np.add.reduceat and compute averages directly from that.

The linear indices in d.ravel() corresponding to idx0 and idx1 are

idx0r = idx0 + d.shape[1] * np.arange(d.shape[0])
idx1r = idx1 + d.shape[1] * np.arange(d.shape[0])

To obtain a linear index of ranges to sum, you can do

idx = (np.stack((idx0, idx1), axis=-1) + d.shape[1] * np.arange(d.shape[0])[:, None]).ravel()

The reason that you want to have a 1D array for this is that you need to be able to remove the last element if it is equal to d.size, since np.add.reduceat does not accept indices past the end of the array:

if idx[-1] == d.size:
    idx = idx[:-1]

This last operation does not copy the data: it just creates a view that is one element short.

You can now sum directly:

sums = np.add.reduceat(d.ravel(), idx)[::2]

Taking every other element is required because the sums between the n-th element of idx1 and n+1st element of idx0 are also present in the sums.

Now you can normalize by the lengths of each segment:

means = sums / (idx1 - idx0)

TL;DR

idx = (np.stack((idx0, idx1), axis=-1) + d.shape[1] * np.arange(d.shape[0])[:, None]).ravel()
if idx[-1] == d.size:
    idx = idx[:-1]
means = np.add.reduceat(d.ravel(), idx)[::2] / (idx1 - idx0)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
1

Another way you can do it is masking instead of indexing. First you create the mask:

mask = np.zeros((d.shape[0], d.shape[1] + 1), np.int8)
mask[np.arange(d.shape[0]), idx0] = 1
mask[np.arange(d.shape[0]), idx1] = -1
mask = mask.cumsum(axis=1)[:, :-1].astype(bool)

The initial array is created with one extra column to accommodate the case where idx1 contains elements that are at d.size[1].

There are a number of approaches you can take going forward.

  1. The first is probably simplest: using a masked array:

    means = np.ma.array(d, mask=~mask).mean(axis=1)
    
  2. Another approach is to pre-multiply the data by the mask and compute your own sums:

    means = (d * mask).sum(axis=1) / (idx1 - idx0)
    
  3. An third approach would be to apply the mask and use np.add.reduceat similarly to how the other answer suggests, but on the masked data, which avoids adding unnecessary numbers:

    n = (idx1 - idx0)
    means = np.add.reduceat(d[mask], np.r_[0, n[:-1].cumsum()]) / n
    
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264