How can I express this large number of computations without for loops?

Question

I work primarily in MATLAB but I think the answer should not be too hard to carry over from one language to another.

I have a multi-dimensional array X with dimensions [n, p, 3]. I would like to calculate the following multi-dimensional array.

T = zeros(p, p, p)
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
end
end
end

The sum is of the elements of a length-n vector. Any help is appreciated!

For easy portability you can simply use loops. In Matlab the Jit compiler is getting better and better, Python has also such aproaches eg. https://stackoverflow.com/a/50890174/4045774 . In C or Fortran you would also use simple loops. But keep the array order in mind (Matlab, Fortran, Julia -> column major order, C, Python usually row major order). You could also improve the Matlab version, if you flip the array dims of T (You coded it as it would be a C or Python function) — max9111, Sep 18 '18 at 13:12
@max9111 I actually got a good speed increase with vectorizing the inner loop, as suggested by Cris. (I am working with `n = 6500` and `p = 300`.) I'll give switching the dimensions a try. — Open Season, Sep 18 '18 at 13:21

score 5 · Accepted Answer · answered Sep 17 '18 at 23:22

5

You only need some permuting of dimensions and multiplication with singleton expansion:

T = sum(bsxfun(@times, bsxfun(@times, permute(X(:,:,1), [2 4 5 3 1]), permute(X(:,:,2), [4 2 5 3 1])), permute(X(:,:,3), [4 5 2 3 1])), 5);

From R2016b onwards, this can be written more easily as

T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);

answered Sep 17 '18 at 23:22

Luis Mendo

110,752
13
76
147

I like the vectorization but for some reason the for loop is about 4 times faster. – Open Season Sep 18 '18 at 00:38
3

@OpenSeason: loops are no longer slow in MATLAB. Vectorization still saves some overhead, but the differences are not nearly as large as they used to be. This vectorization solution uses `permute`, which copies the data in a different order. The overhead of copying is more important than the overhead of a loop. I would suggest you keep using your loops, and be satisfied with the idea that they are no longer the speed penalty they once were. Maybe you can vercorize the inner loop only for a small advantage? – Cris Luengo Sep 18 '18 at 02:03
@OpenSeason: I was wrong, vectorization does seem to yield large speedups on my system. I see Luis' solution being 20x faster than the original code with `n=50, p=100`. If you see this being slower, it is likely that you have a much larger array. I have posted an alternative solution that vectorizes only the inner loop. – Cris Luengo Sep 18 '18 at 04:42
Your answer is perfectly completed by @CrisLuengo. My answer is completely out. Just for argument sake do you have a way to transform tensor notation to permutation formula like yours, or any reference, article ? – PilouPili Sep 18 '18 at 11:02
1

@NPE Not really. The trick is just to shift dimensions so that if you need all tuples involving some of the original dimensions, those dimensions are moved to different dimension indices (so the original 2nd dim here is shifted to 1st, 2nd and 3rd). Then you perform the desired computation, which involves singleton expansion to implicitly create the tuples and some operation (sum here) that usually reduces some dimensions (5th here) to singletons. Lastly, you may need to shift dimensions again to move those singletons to the right (not needed here, as it was done in the first step already) – Luis Mendo Sep 18 '18 at 11:29

Cris Luengo · Answer 2 · 2018-09-18T04:46:09.537

As I mentioned in a comment, vectorization is not always a huge advantage any more. Therefore there are vectorization methods that slow down the code rather than speed it up. You must always time your solutions. Vectorization often involves the creation of large temporary arrays, or the copy of large amounts of data, which are avoided in loop code. It depends on the architecture, the size of the input, and many other factors if such a solution is going to be faster.

Nonetheless, in this case it seems vectorization approaches can yield a large speedup.

The first thing to notice about the original code is that X(:, i, 1) .* X(:, j, 2) gets re-computed in the inner loop, though it is a constant value there. Rewriting the inner loop as this will save time:

Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
   T(i, j, k) = sum(Y .* X(:, k, 3));
end

Now we notice that the inner loop is a dot product, and can be written as follows:

Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);

The .' transpose on Y does not copy the data, as Y is a vector. Next, we notice that X(:, :, 3) is indexed repeatedly. Let's move this out of the outer loop. Now I'm left with the following code:

T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
   for j = 1:p
      Y = X1(:, i) .* X2(:, j);
      T(i, j, :) = Y.' * X3;
   end
end

It is likely that removing the loop over j is equally easy, which would leave a single loop over i. But this is where I stop.

This is the timings I see (R2017a, 3-year old iMac with 4 cores). For n=10, p=20:

original:                     0.0206
moving Y out the inner loop:  0.0100
removing inner loop:          0.0016
moving indexing out of loops: 7.6294e-04
Luis' answer:                 1.9196e-04

For a larger array with n=50, p=100:

original:                     2.9107
moving Y out the inner loop:  1.3488
removing inner loop:          0.0910
moving indexing out of loops: 0.0361
Luis' answer:                 0.1417

"Luis' answer" is this one. It is by far fastest for small arrays, but for larger arrays it shows the cost of the permutation. Moving the computation of the first product out of the inner loop saves a bit over half the computation cost. But removing the inner loop reduces the cost quite dramatically (which I hadn't expected, I presume the single matrix product can use parallelism better than the many small element-wise products). We then get a further time reduction by reducing the amount of indexing operations within the loop.

This is the timing code:

function so()
n = 10; p = 20;
%n = 50; p = 100; 
X = randn(n,p,3);

T1 = method1(X);
T2 = method2(X);
T3 = method3(X);
T4 = method4(X);
T5 = method5(X);

assert(max(abs(T1(:)-T2(:)))<1e-13)
assert(max(abs(T1(:)-T3(:)))<1e-13)
assert(max(abs(T1(:)-T4(:)))<1e-13)
assert(max(abs(T1(:)-T5(:)))<1e-13)

timeit(@()method1(X))
timeit(@()method2(X))
timeit(@()method3(X))
timeit(@()method4(X))
timeit(@()method5(X))

function T = method1(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
   for j = 1:p
      for k = 1:p
         T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
      end
   end
end

function T = method2(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
   for j = 1:p
      Y = X(:, i, 1) .* X(:, j, 2);
      for k = 1:p
         T(i, j, k) = sum(Y .* X(:, k, 3));
      end
   end
end

function T = method3(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
   for j = 1:p
      Y = X(:, i, 1) .* X(:, j, 2);
      T(i, j, :) = Y.' * X(:, :, 3);
   end
end

function T = method4(X)
p = size(X,2);
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
   for j = 1:p
      Y = X1(:, i) .* X2(:, j);
      T(i, j, :) = Y.' * X3;
   end
end

function T = method5(X)
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);

Divakar · Answer 3 · 2018-09-18T06:07:21.987

You have mentioned you are open to other languages and NumPy by its syntax is very close to MATLAB, so we will try to have a NumPy based solution on this.

Now, these tensor related sum-reductions, specially matrix multiplications ones are easily expressed as einstein-notation and NumPy luckily has one function on the same as np.einsum. Under the hoods, it's implemented in C and is pretty efficient. Recently it's been optimized further to leverage BLAS based matrix-multiplication implementations.

So, a translation of the stated code onto NumPy territory keeping in mind that it follows 0-based indexing and the axes are visuallized differently than the dimensions with MATLAB, would be -

import numpy as np

# X is a NumPy array of shape : (n,p,3). So, a random one could be
# generated with : `X = np.random.rand(n,p,3)`.

T = np.zeros((p, p, p))
for i in range(p):
    for j in range(p):
        for k in range(p):
            T[i, j, k] = np.sum(X[:, i, 0] * X[:, j, 1] * X[:, k, 2])

The einsum way to solve it would be -

np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2])

To leverage matrix-multiplication, use optimize flag -

np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2],optimize=True)

Timings (with large sizes)

In [27]: n,p = 100,100
    ...: X = np.random.rand(n,p,3)

In [28]: %%timeit
    ...: T = np.zeros((p, p, p))
    ...: for i in range(p):
    ...:     for j in range(p):
    ...:         for k in range(p):
    ...:             T[i, j, k] = np.sum(X[:, i, 0] * X[:, j, 1] * X[:, k, 2])
1 loop, best of 3: 6.23 s per loop

In [29]: %timeit np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2])
1 loop, best of 3: 353 ms per loop

In [31]: %timeit np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2],optimize=True)
100 loops, best of 3: 10.5 ms per loop

In [32]: 6230.0/10.5
Out[32]: 593.3333333333334

Around 600x speedup there!

To be fair I believe you should do the same operations as @CrisLuengo did in his answer in the `for` solution to compare with `einsum`. — PilouPili, Sep 18 '18 at 11:05
@NPE I trusted that CrisLuengo did the same operations as OP did. I am doing the same operations as OP, with the only difference in the way NumPy treats axes against dimensions in MATLAB. — Divakar, Sep 18 '18 at 11:18
Just saying T = np.zeros((p, p, p)) X1 = X[:, : ,0] X2 = X[:, :, 1] X3 = X[:, :, 2] for i in range(p): for j in range(p): tmp = X1[:, i] * X2[:, j] T[i,j,:] = np.dot(np.transpose(tmp), X3[:], ) return T is actually more like 250 ms rather than 6s which compares to einsum without the BLAS. However the optimize=True to use the BLAS is magic ! Wonderful what the np.dot function can do for you. — PilouPili, Sep 18 '18 at 12:02

How can I express this large number of computations without for loops?

3 Answers3