1

I have two arrays A and B of the same dimension 1000 x 3 x 20 x 20. I want to generate a third array C of dimension 3 x 3 x 20 x 20 that would be an outcome of matrix multiplication of corresponding slices of A and B, i.e. C(:,:,i,j) = A(:,:,i,j)'*B(:,:,i,j). Then I need to transform array C to the new array D by inverting the corresponding 3 x 3 matrices, i.e. D(:,:,i,j) = inv(C(:,:,i,j)). Again, it's clear how to do this with loops. is there a way to awoid looping over 400 items?

Edit: The benchmarking code to compare performance of different solutions would be -

%// Inputs
n1 = 50;
n2 = 200;
A = rand(n1,3,n2,n2);
B = rand(n1,3,n2,n2);

%// A. CPU loopy code
tic
C = zeros(3,3,n2,n2);
for ii = 1:n2
    for jj = 1:n2
        C(:,:,ii,jj) = A(:,:,ii,jj)'*B(:,:,ii,jj); %//'
    end
end
toc

%// B. Vectorized code (using squeeze)
tic
C1 = squeeze(sum(bsxfun(@times,permute(A,[2 1 5 3 4]),permute(B,[5 1 2 3 4])),2));
toc

%// C. Vectorized code (avoiding squeeze)
tic
C2 = sum(bsxfun(@times,permute(A,[2 5 3 4 1]),permute(B,[5 2 3 4 1])),5);
toc

%// D. GPU vectorized code
tic
A = gpuArray(A);
B = gpuArray(B);
C3 = sum(bsxfun(@times,permute(A,[2 5 3 4 1]),permute(B,[5 2 3 4 1])),5);
C3 = gather(C3);
toc

Runtime results -

Elapsed time is 0.287511 seconds.
Elapsed time is 0.250663 seconds.
Elapsed time is 0.337628 seconds.
Elapsed time is 1.259207 seconds.
Divakar
  • 218,885
  • 19
  • 262
  • 358
Laimond
  • 199
  • 1
  • 7
  • 1
    take a look here http://www.mathworks.com/matlabcentral/fileexchange/8773-multiple-matrix-multiplications--with-array-expansion-enabled – Nishant Oct 28 '14 at 12:23
  • Wow! That's some useful and interesting runtime results. Thanks! – Divakar Oct 28 '14 at 15:06
  • Also, one more thing I forgot to mention was that you need to "Warmup" GPU before benchmarking. So, the simplest way to achieve that would be to run the benchmarking code as it is and run it again and observe the runtimes in that second run. The trusted method to benchmark GPU codes is with `gputimeit`, but that would complicate the codes, so staying away from that for now. – Divakar Oct 28 '14 at 15:21

1 Answers1

1

Code

%// Part - 1
C = sum(bsxfun(@times,permute(A,[2 5 3 4 1]),permute(B,[5 2 3 4 1])),5);

%// Part - 2: Use MATLAB file-exchange tool multinv
D = multinv(C);

The function code for multinv is available here and it claims to be pretty efficient.

For the first part, you can also try out this -

C = squeeze(sum(bsxfun(@times,permute(A,[2 1 5 3 4]),permute(B,[5 1 2 3 4])),2));

This one seems to be re-arranging the elements not as "disruptively" as the one mentioned in the code above, but the downside is the need for squeeze that might slow it down a bit. I would leave it to you and also encourage you to benchmark and select the better one.


Why bsxfun + GPU?

I have increased the loop limits, as that could be a real test between a loopy code and a vectorized code. So, here is the modified code for part 1 -

%// Inputs
n1 = 50;
n2 = 200;
A = rand(n1,3,n2,n2);
B = rand(n1,3,n2,n2);

%// A. CPU loopy code
tic
C = zeros(3,3,n2,n2);
for ii = 1:n2
    for jj = 1:n2
        C(:,:,ii,jj) = A(:,:,ii,jj)'*B(:,:,ii,jj); %//'
    end
end
toc

%// B. GPU vectorized code
tic
A = gpuArray(A);
B = gpuArray(B);
C1 = sum(bsxfun(@times,permute(A,[2 5 3 4 1]),permute(B,[5 2 3 4 1])),5);
C1 = gather(C1);
toc

The runtime results at my system were -

Elapsed time is 0.310056 seconds.
Elapsed time is 0.172499 seconds.

So, you see!

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I still lack the intuition for how the bsxfun works with multidimensional arrays. So thank you very much for answering a similar question one more time and thanks for referring to multinv. – Laimond Oct 28 '14 at 13:24
  • 1
    @Laimond The play about `bsxfun` works very closely with `permute` that is used most of times (when working with `bsxfun`) to create singleton dimensions, which are expanded by `bsxfun` internally before performing the operations mentioned by the handle `@`. It takes time, but play around with it really! – Divakar Oct 28 '14 at 13:26
  • 2
    @Laimond Also, if you have a decent GPU, you would see the crazy benefits of working with bsxfun. – Divakar Oct 28 '14 at 13:27
  • It seems that the second solution is slightly faster. My GPU is a very basic one, GeForce 410M. Do you mean some kind of parallel computing, or you mean that if my GPU was better, the code would run faster? – Laimond Oct 28 '14 at 14:35
  • @Laimond Edited the solution with the GPU codes. I would be dearly interested in seeing results at your end too if you don't mind being bothered on this. – Divakar Oct 28 '14 at 15:01
  • The solution with `squeeze` function is the fastest on my system. I need to update my workstation... – Laimond Oct 28 '14 at 15:20
  • @Laimond BTW mine was with GTX 750 Ti and that could have made the difference I believe. – Divakar Oct 28 '14 at 15:24