As I mentioned in a comment, vectorization is not always a huge advantage any more. Therefore there are vectorization methods that slow down the code rather than speed it up. You must always time your solutions. Vectorization often involves the creation of large temporary arrays, or the copy of large amounts of data, which are avoided in loop code. It depends on the architecture, the size of the input, and many other factors if such a solution is going to be faster.
Nonetheless, in this case it seems vectorization approaches can yield a large speedup.
The first thing to notice about the original code is that X(:, i, 1) .* X(:, j, 2)
gets re-computed in the inner loop, though it is a constant value there. Rewriting the inner loop as this will save time:
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
end
Now we notice that the inner loop is a dot product, and can be written as follows:
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
The .'
transpose on Y
does not copy the data, as Y
is a vector. Next, we notice that X(:, :, 3)
is indexed repeatedly. Let's move this out of the outer loop. Now I'm left with the following code:
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
end
end
It is likely that removing the loop over j
is equally easy, which would leave a single loop over i
. But this is where I stop.
This is the timings I see (R2017a, 3-year old iMac with 4 cores). For n=10, p=20
:
original: 0.0206
moving Y out the inner loop: 0.0100
removing inner loop: 0.0016
moving indexing out of loops: 7.6294e-04
Luis' answer: 1.9196e-04
For a larger array with n=50, p=100
:
original: 2.9107
moving Y out the inner loop: 1.3488
removing inner loop: 0.0910
moving indexing out of loops: 0.0361
Luis' answer: 0.1417
"Luis' answer" is this one. It is by far fastest for small arrays, but for larger arrays it shows the cost of the permutation. Moving the computation of the first product out of the inner loop saves a bit over half the computation cost. But removing the inner loop reduces the cost quite dramatically (which I hadn't expected, I presume the single matrix product can use parallelism better than the many small element-wise products). We then get a further time reduction by reducing the amount of indexing operations within the loop.
This is the timing code:
function so()
n = 10; p = 20;
%n = 50; p = 100;
X = randn(n,p,3);
T1 = method1(X);
T2 = method2(X);
T3 = method3(X);
T4 = method4(X);
T5 = method5(X);
assert(max(abs(T1(:)-T2(:)))<1e-13)
assert(max(abs(T1(:)-T3(:)))<1e-13)
assert(max(abs(T1(:)-T4(:)))<1e-13)
assert(max(abs(T1(:)-T5(:)))<1e-13)
timeit(@()method1(X))
timeit(@()method2(X))
timeit(@()method3(X))
timeit(@()method4(X))
timeit(@()method5(X))
function T = method1(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
end
end
end
function T = method2(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
end
end
end
function T = method3(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
end
end
function T = method4(X)
p = size(X,2);
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
end
end
function T = method5(X)
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);