fastest way to stack ndarrays

Question

Gist

Basically I want to perform an increase in dimension of two axes on a n-dimensonal tensor. For some reason this operation seems very slow on bigger tensors. If someone can give me a reason or better method I'd be very happy.

Goal

Going from (4, 8, 8, 4, 4, 4, 4, 4, 16, 8, 4, 4, 1) to (4, 32, 8, 4, 4, 4, 4, 4, 4, 8, 4, 4, 1) takes roughly 170 second. I'd like to improve on that. Below is an example, finding the correct indices is not necessary here.

Example Code

Increase dimension (0,2) of tensor

tensor = np.arange(16).reshape(2,2,4,1)
I = np.identity(4)

I tried 3 different methods:

np.kron

indices = [1,3,0,2]
result = np.kron(
            I, tensor.transpose(indices)
        ).transpose(np.argsort(indices))
print(result.shape) # should be (8,2,16,1)

manual stacking

col = []
for i in range(4):
    row  = [np.zeros_like(tensor)]*4
    row[i]=tensor
    col.append(a)
result = np.array(col).transpose(0,2,3,1,4,5).reshape(8,2,16,1)
print(result.shape) # should be (8,2,16,1)

np.einsum

result =np.einsum("ij, abcd -> iabjcd", I, tensor).reshape(8,2,16,1)
print(result.shape) # should be (8,2,16,1)

Results

On my machine they performed the following (on the big example with complex entries):

np.einsum ~ 170s
manual stacking ~ 185s
np.kron ~ 580s

1) Numpy is nor optimized to deal with array in 13 dimensions (which is clearly not reasonable). 2) your array appear to be huge like 8 GiB since you use complex numbers. 2) all your operations seems to involve a transposition which is known to be very expensive on modern hardware. Additionally, do not expect anyone to optimize a 13D transposition (since >4D transposition are already insane to optimize)... — Jérôme Richard, Mar 24 '22 at 19:48
so basically it comes down to memory laytout correct? i have a similar sized example which workd just fine — gistBatch, Mar 24 '22 at 20:12
Have you tried allocating the result with `zeros` and assigning the original into it? — user7138814, Mar 25 '22 at 21:19

score 1 · Answer 1 · answered Mar 28 '22 at 11:02

As Jérôme pointed out:

all your operations seems to involve a transposition which is known to be very expensive on modern hardware.

I reworked my algorithm to not rely on the dimensional increase by doing certain preprocessing steps. This indeed speeds up the overall process substantially.

fastest way to stack ndarrays

Gist

Goal

Example Code

Results

1 Answers1