1

I am writing a python package that performs various complex statistical analysis tasks along an arbitrary axis of an arbitrarily-shaped numpy array.

Currently, so that the array shape and axis can be arbitrary, I just permute the array so the axis of interest is placed on the far RHS, and squash the LHS axes into one. For example, if the array shape is (3,4,5), and we want to perform some operation along axis 1, it is transformed into the shape (15,4), the operations is performed along axis -1, then it is transformed back into the shape (3,4,5) and returned by the function.

I feel this approach may be unnecessarily slow because of all these array manipulations. Is there a way that I can cleanly iterate over all but one dimension of the array? That is, in the above example this would go [0,:,0], [0,:,1], ..., [2,:,3], [2,:,4], but again this should work for arbitrary array shape and axis position.

Maybe np.ndenumerate, np.ndindex, and np.take can be used for this somehow?


Edit: Is there a way to do this with np.nditer? Perhaps this can match the speed of permuting/reshaping.

Luke Davis
  • 2,548
  • 2
  • 21
  • 43
  • `np.apply_along_axis` might work for you. But look at it's code. It transposes the desired axis to the end, uses `ndindex` to generate the indices, and iterates on them, and then transposes the result back. Sounds a lot like your approach. – hpaulj Apr 29 '19 at 21:47
  • 1
    Your permute and reshape isn't that expensive. In some cases it's just a view, though in general it will produce a copy. But compared to the iteration that will probably be cheap. But do some timings; don't just go by feelings. `apply...` docs claims to be faster than an equivalent using a split `ndindex`, but I don't know the difference is significant. I've recommended against it in simple 2d cases where the extra baggage is unnecessary. – hpaulj Apr 29 '19 at 21:55
  • Can you please provide us with some sample arrays to work with? Particularly the function that you wish to apply over this axis. The axis permutation is not an expensive operation, either in terms of memory or time. So, we've to timeit and see as @hpaulj mentioned.. – kmario23 Apr 29 '19 at 22:06
  • 1
    @kmario23 Took a crack at it and you guys were right, permutation and reshaping were indeed much faster (see below answer). This really surprised me. Would have thought in some situations it would be slower, i.e. when RAM is a bottleneck... but this doesn't seem to be the case. – Luke Davis Apr 29 '19 at 22:33
  • 1
    @LukeDavis If you're really concerned about memory and want to avoid copying during a reshape, please take a look at the [Notes in Numpy reshape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html). It provides a mechanism to raise error when copying of data happens. – kmario23 Apr 29 '19 at 22:41

1 Answers1

1

Turns out just transposing and reshaping is indeed faster. So I guess the answer is... don't do that, it is preferable to permute and reshape as I was already doing.

Here's the code from my project.

# Benchmark
f = lambda x: x # can change this to any arbitrary function
def test1(data, axis=-1):
    # Test the lead flatten approach
    data, shape = lead_flatten(permute(data, axis))
    output = np.empty(data.shape)
    for i in range(data.shape[0]): # iterate along first dimension; each row is an autocor
        output[i,:] = f(data[i,:]) # arbitrary complex equation
    return unpermute(lead_unflatten(output, shape), axis)
def test2(data, axis=-1):
    # Test the new approach
    output = np.empty(data.shape)
    for d,o in zip(iter_1d(data, axis), iter_1d(output, axis)):
        o[...] = f(d)
    return output

# Iterator class
class iter_1d(object):
    def __init__(self, data, axis=-1):
        axis = (axis % data.ndim) # e.g. for 3D array, -1 becomes 2
        self.data = data
        self.axis = axis
    def __iter__(self):
        shape = (s for i,s in enumerate(self.data.shape) if i!=self.axis)
        self.iter = np.ndindex(*shape)
        return self
    def __next__(self):
        idx = self.iter.next()
        idx = [*idx]
        idx.insert(self.axis, slice(None))
        return self.data[idx]

# Permute and reshape functions
def lead_flatten(data, nflat=None):
    shape = list(data.shape)
    if nflat is None:
        nflat = data.ndim-1 # all but last dimension
    if nflat<=0: # just apply singleton dimension
        return data[None,...], shape
    return np.reshape(data, (np.prod(data.shape[:nflat]).astype(int), *data.shape[nflat:]), order='C'), shape # make column major

def lead_unflatten(data, shape, nflat=None):
    if nflat is None:
        nflat = len(shape) - 1 # all but last dimension
    if nflat<=0: # we artificially added a singleton dimension; remove it
        return data[0,...]
    if data.shape[0] != np.prod(shape[:nflat]):
        raise ValueError(f'Number of leading elements {data.shape[0]} does not match leading shape {shape[nflat:]}.')
    if not all(s1==s2 for s1,s2 in zip(data.shape[1:], shape[nflat:])):
        raise ValueError(f'Trailing dimensions on data, {data.shape[1:]}, do not match trailing dimensions on new shape, {shape[nflat:]}.')
    return np.reshape(data, shape, order='C')

def permute(data, source=-1, destination=-1):
    data = np.moveaxis(data, source, destination)
    return data

def unpermute(data, source=-1, destination=-1):
    data = np.moveaxis(data, destination, source)
    return data

And here are results from some %timeit operations.

import numpy as np
a = np.random.rand(10,20,30,40)
%timeit -r10 -n10 test1(a, axis=2) # around 12ms
%timeit -r10 -n10 test2(a, axis=2) # around 22ms
Luke Davis
  • 2,548
  • 2
  • 21
  • 43