56

Background

My question is motivated by simple observations, which somewhat undermine the beliefs/assumptions often held/made by experienced MATLAB users:

  • MATLAB is very well optimized when it comes to the built-in functions and the fundamental language features, such as indexing vectors and matrices.
  • Loops in MATLAB are slow (despite the JIT) and should generally be avoided if the algorithm can be expressed in a native, 'vectorized' manner.

The bottom line: core MATLAB functionality is efficient and trying to outperform it using MATLAB code is hard, if not impossible.

Investigating performance of vector indexing

The example codes shown below are as fundamental as it gets: I assign a scalar value to all vector entries. First, I allocate an empty vector x:

tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.

Having x I would like to set all its entries to the same value. In practice you would do it differently, e.g., x = value*ones(1e8,1), but the point here is to investigate the performance of vector indexing. The simplest way is to write:

tic; x(:) = 1; toc
Elapsed time is 0.094316 seconds.

Let's call it method 1 (from the value assigned to x). It seems to be very fast (faster at least than memory allocation). Because the only thing I do here is operate on memory, I can estimate the efficiency of this code by calculating the obtained effective memory bandwidth and comparing it to the hardware memory bandwidth of my computer:

eff_bandwidth = numel(x) * 8 bytes per double * 2 / time

In the above, I multiply by 2 because unless SSE streaming is used, setting values in memory requires that the vector is both read from and written to the memory. In the above example:

eff_bandwidth(1) = 1e8*8*2/0.094316 = 17 Gb/s

STREAM-benchmarked memory bandwidth of my computer is around 17.9 Gb/s, so indeed - MATLAB delivers close to peak performance in this case! So far, so good.

Method 1 is suitable if you want to set all vector elements to some value. But if you want to access elements every step entries, you need to substitute the : with e.g., 1:step:end. Below is a direct speed comparison with method 1:

tic; x(1:end) = 2; toc
Elapsed time is 0.496476 seconds.

While you would not expect it to perform any different, method 2 is clearly big trouble: factor 5 slowdown for no reason. My suspicion is that in this case MATLAB explicitly allocates the index vector (1:end). This is somewhat confirmed by using explicit vector size instead of end:

tic; x(1:1e8) = 3; toc
Elapsed time is 0.482083 seconds.

Methods 2 and 3 perform equally bad.

Another possibility is to explicitly create an index vector id and use it to index x. This gives you the most flexible indexing capabilities. In our case:

tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc
Elapsed time is 1.208419 seconds.

Now that is really something - 12 times slowdown compared to method 1! I understand it should perform worse than method 1 because of the additional memory used for id, but why is it so much worse than methods 2 and 3?

Let's try to give the loops a try - as hopeless as it may sound.

tic;
for i=1:numel(x)
    x(i) = 5;
end
toc
Elapsed time is 0.788944 seconds.

A big surprise - a loop beats a vectorized method 4, but is still slower than methods 1, 2 and 3. It turns out that in this particular case you can do it better:

tic;
for i=1:1e8
    x(i) = 6;
end
toc
Elapsed time is 0.321246 seconds.

And that is the probably the most bizarre outcome of this study - a MATLAB-written loop significantly outperforms native vector indexing. That should certainly not be so. Note that the JIT'ed loop is still 3 times slower than the theoretical peak almost obtained by method 1. So there is still plenty of room for improvement. It is just surprising (a stronger word would be more suitable) that usual 'vectorized' indexing (1:end) is even slower.

Questions

  • is simple indexing in MATLAB very inefficient (methods 2, 3, and 4 are slower than method 1), or did I miss something?
  • why is method 4 (so much) slower than methods 2 and 3?
  • why does using 1e8 instead of numel(x) as a loop bound speed up the code by factor 2?

Edit After reading Jonas's comment, here is another way to do that using logical indices:

tic;
id = logical(ones(1, 1e8));
x(id) = 7;
toc
Elapsed time is 0.613363 seconds.

Much better than method 4.

For convenience:

function test

tic; x = zeros(1,1e8); toc

tic; x(:) = 1; toc
tic; x(1:end) = 2; toc
tic; x(1:1e8) = 3; toc

tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc

tic;
for i=1:numel(x)
    x(i) = 5;
end
toc

tic;
for i=1:1e8
    x(i) = 6;
end
toc

end
angainor
  • 11,760
  • 2
  • 36
  • 56
  • 6
    +1. Keep on researching the mysterious ways of Matlab :) – Andrey Rubshtein Nov 14 '12 at 15:50
  • 1
    Very interesting. I added two more to your comparison: using `length(x)` instead of `numel(x)` is much slower. Using `x = x*6;` is as fast as method 1. – R. Schifini Nov 14 '12 at 16:40
  • @Cyrgo: unfortunately, your comment seems to have been cut off. – Jonas Nov 14 '12 at 16:44
  • @Cyrgo It seems `length` and `numel` give the same times for me. Maybe a different matlab version? I run 2011b. For the second idea, it would work if `x` is ones, but it does not use vector indexing, which is what I was looking at. As such, it will indeed be faster, but somewhat less general :) – angainor Nov 14 '12 at 16:52
  • 1
    @angainor: `true(1e8,1)` may be faster than the logical conversion. – Jonas Nov 14 '12 at 16:59
  • @Jonas Heh, I was looking for that :) Thanks. Alas, it is not faster :) – angainor Nov 14 '12 at 17:03
  • It is a bit faster for me (on R2012b), though there seems to have been a relative performance regression from 2011b to 2012b re:logical indexing. – Jonas Nov 14 '12 at 17:05
  • @angainor: I'm running version R2009a. – R. Schifini Nov 14 '12 at 19:14
  • 2
    @angainor Great question +1. No answer from me alas, but I thought I'd point out something interesting about Matlab's loops. Try putting `for z = 1:1` around each of your solutions. On my machine (R2012b) there is no effect on solutions 1 to 4, but an order of magnitude slowdown on 5 and 6. In other words, the JIT accelerator is still terrible at spotting inefficiencies in double loops, even if the outer loop has one iteration and serves no purpose! Oh, one other thing, maybe change your `i` subscripts. Technically, `i` is an in-built function. Avoid `j` for the same reason. :-) – Colin T Bowers Nov 15 '12 at 00:51
  • This is a ridiculously interesting analysis... Just goes to show that I need to profile my code rather than make assumptions about the slow parts. Extremely useful, and I wish I could give more upvotes – im so confused Nov 15 '12 at 15:50
  • Fascinating. @ColinTBowers: I was able to screw up the JIT on the loops similarly by just changing one of the `toc;` statements before the loops to `te = toc; fprintf('%-18s %0.3f\n', 'x(id)', te);`. Both the loops slowed down about 10x without directly touching the loop code. It looks like passing a variable to a simple non-builtin function is enough to bork the optimization, and you'd need to keep the loops tightly contained in their own functions if you wanted to use them in real code. (R2014a on OS X.) – Andrew Janke May 16 '14 at 04:38
  • 3
    @AndrewJanke Yes it is fascinating. That's the problem with proprietary software I suppose - the only way to discover and diagnose these oddities is through indirect experimentation. Matlab may have lovely syntax and a great IDE, but the whole closed source thing can be a pain sometimes... – Colin T Bowers May 17 '14 at 02:05

3 Answers3

14

I can, of course, only speculate. However when I run your test with the JIT compiler enabled vs disabled, I get the following results:

 % with JIT   no JIT
    0.1677    0.0011 %# init
    0.0974    0.0936 %# #1 I added an assigment before this line to avoid issues with deferring
    0.4005    0.4028 %# #2
    0.4047    0.4005 %# #3
    1.1160    1.1180 %# #4
    0.8221   48.3239 %# #5 This is where "don't use loops in Matlab" comes from 
    0.3232   48.2197 %# #6
    0.5464   %# logical indexing

Dividing shows us where there is any speed increase:

% withoutJit./withJit
    0.0067 %# w/o JIT, the memory allocation is deferred
    0.9614 %# no JIT
    1.0057 %# no JIT
    0.9897 %# no JIT
    1.0018 %# no JIT
   58.7792 %# numel
  149.2010 %# no numel

The apparent speed-up on initialization happens, because with JIT turned off it appears that MATLAB delays the memory allocation until it is used, so x=zeros(...) does not do anything really. (thanks, @angainor).

Methods 1 through 4 don't seem to benefit from the JIT. I guess that #4 could be slow due to additional input testing in subsref to make sure that the input is of the proper form.

The numel result could have something to do with it being harder for the compiler to deal with uncertain number of iterations, or with some overhead due to checking whether the bound of the loop is ok (thought no-JIT tests suggest only ~0.1s for that)

Surprisingly, on R2012b on my machine, logical indexing seems to be slower than #4.

I think that this goes to show, once again, that MathWorks have done great work in speeding up code, and that "don't use loops" isn't always best if you're trying to get the fastest execution time (at least at the moment). Nevertheless, I find that vectorizing is in general a good approach, since (a) the JIT fails on more complex loops, and (b) learning to vectorize makes you understand Matlab a lot better.

Conclusion: If you want speed, use the profiler, and re-profile if you switch Matlab versions. As pointed out by @Adriaan in the comments, nowadays it may be better to use timeit() to measure execution speed.


For reference, I used the following slightly modified test function

function tt = speedTest

tt = zeros(8,1);

tic; x = zeros(1,1e8); tt(1)=toc;

x(:) = 2;

tic; x(:) = 1; tt(2)=toc;
tic; x(1:end) = 2; tt(3)=toc;
tic; x(1:1e8) = 3; tt(4)=toc;

tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
tt(5)=toc;

tic;
for i=1:numel(x)
    x(i) = 5;
end
tt(6)=toc;

tic;
for i=1:1e8
    x(i) = 6;
end
tt(7)=toc;

%# logical indexing
tic;
id = true(1e8,1));
x(id)=7;
tt(8)=toc;
Jonas
  • 74,690
  • 10
  • 137
  • 177
  • 2
    The initialization "problem" comes from the fact that memory allocation and zeroing of the vector both take time. With JIT turned off it appears that MATLAB delays the memory allocation until it is used, so `x=zeros(...)` does not do anything really. The first time result is for `:`, not `1:end`. So MATLAB just moved the memory allocation to `x(:)=..`. If you add a dummy warmup line for `x` after allocation (`x(:)=0`), the times are the same. Except, of course, for the loops.. – angainor Nov 14 '12 at 16:16
  • As for runtime resolution of `numel`, of course some work has to be done. However, compared to the entire work it should be negligible. The same results are obtained if I do `len=numel(x)` before the `tic`, and use it as the loop bound. It looks to me like a performance bug. Much like the rest of those observations. – angainor Nov 14 '12 at 16:23
  • As for methods 2 through 4 - I think you are exactly right. They do not benefit from JIT, and they should. As I noted, it seems that the index vectors are explicitly allocated, filled with values and used in indexing! That is a crime in the modern world of memory bandwidth bounded computers! – angainor Nov 14 '12 at 16:25
  • @angainor: fixed the answer, and added some more speculation on #4. As a random aside: if you pass a logical vector, it is internally translated to an index vector, which means that the logical vector can be longer than the array in which you index, as long as the "overhang" is all false. – Jonas Nov 14 '12 at 16:28
  • @angainor: I think that there is similar checking overhead with any kind of "indirect" indexing. Anyway, interesting question! – Jonas Nov 14 '12 at 16:40
  • I was wrong, logical indices are faster than method 4. Thanks, I have updated my question. – angainor Nov 14 '12 at 16:51
  • 1
    I'm not sure about using the `profiler` when you go for speed nowadays (2020), as [it apparently switches off the JIT](https://mathworks.com/matlabcentral/answers/18576-profiler-biases-execution-time-of-optimized-code). By now, [`timeit()`](https://mathworks.com/help/matlab/ref/timeit.html) has been built-in; I'd suggest to use that to compare runtime more accurately. – Adriaan Nov 25 '20 at 09:40
9

I do not have an answer to all the problems, but I do have some refined speculations on methods 2, 3 and 4.

Regarding methods 2 and 3. It does indeed seem that MATLAB allocates memory for the vector indices and fills it with values from 1 to 1e8. To understand it, lets see what is going on. By default, MATLAB uses double as its data type. Allocating the index array takes the same time as allocating x

tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.

For now, the index array contains only zeros. Assigning values to the x vector in an optimal way, as in method 1, takes 0.094316 seconds. Now, the index vector has to be read from the memory so that it can be used in indexing. That is additional 0.094316/2 seconds. Recall that in x(:)=1 vector x has to be both read from and written to the memory. So only reading it takes half the time. Assuming this is all that is done in x(1:end)=value, the total time of methods 2 and 3 should be

t = 0.260525+0.094316+0.094316/2 = 0.402

It is almost correct, but not quite. I can only speculate, but filling the index vector with values is probably done as an additional step and takes additional 0.094316 seconds. Hence, t=0.4963, which more or less fits with the time of methods 2 and 3.

These are only speculations, but they do seem to confirm that MATLAB explicitly creates index vectors when doing native vector indexing. Personally, I consider this to be a performance bug. MATLABs JIT compiler should be smart enough to understand this trivial construct and convert it to a call to a correct internal function. As it is now, on the todays memory bandwidth bounded architectures indexing performs at around 20% theoretical peak.

So if you do care about performance, you will have to implement x(1:step:end) as a MEX function, something like

set_value(x, 1, step, 1e8, value);

Now this is clearly illegal in MATLAB, since you are NOT ALLOWED to modify arrays in the MEX files inplace.

Edit Regarding method 4, one can try to analyze the performance of the individual steps as follows:

tic;
id = 1:1e8; % colon(1,1e8);
toc
tic
x(id) = 4;
toc

Elapsed time is 0.475243 seconds.
Elapsed time is 0.763450 seconds.

The first step, allocation and filling the values of the index vector takes the same time as methods 2 and 3 alone. It seems that it is way too much - it should take at most the time needed to allocate the memory and to set the values (0.260525s+0.094316s = 0.3548s), so there is an additional overhead of 0.12 seconds somewhere, which I can not understand. The second part (x(id) = 4) looks also very inefficient: it should take the time needed to set the values of x, and to read the id vector (0.094316s+0.094316/2s = 0.1415s) plus some error checks on the id values. Programed in C, the two steps take:

create id                              0.214259
x(id) = 4                              0.219768

The code used checks that a double index in fact represents an integer, and that it fits the size of x:

tic();
id  = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("create id");

tic();
for(i=0; i<n; i++) {
  long iid = (long)id[i];
  if(iid>=0 && iid<n && (double)iid==id[i]){
    x[iid] = 4;
  } else break;
}
toc("x(id) = 4");

The second step takes longer than the expected 0.1415s - that is due to the necessity of error checks on id values. The overhead seems too large to me - maybe it could be written better. Still, the time required is 0.4340s , not 1.208419s. What MATLAB does under the hood - I have no idea. Maybe it is necessary to do it, I just don't see it.

Of course, using doubles as indices introduces two additional levels of overhead:

  • size of double twice the size of uint32. Recall that memory bandwidth is the limiting factor here.
  • doubles need to be cast to integers for indexing

Method 4 can be written in MATLAB using integer indices:

tic;
id = uint32(1):1e8;
toc
tic
x(id) = 8;
toc

Elapsed time is 0.327704 seconds.
Elapsed time is 0.561121 seconds.

Which clearly improved the performance by 30% and proves that one should use integers as vector indices. However, the overhead is still there.

As I see it now, we can not do anything to improve the situation working within the MATLAB framework, and we have to wait till Mathworks fixes these issues.

angainor
  • 11,760
  • 2
  • 36
  • 56
  • @Jonas I did not yet do that. I wanted to ask on SO to see if I overlooked something. I also hoped that someone from Mathworks may answer - there were a number of people from MW around here. A performance bug is not a usual bug. The code works and gives correct results. I expect that the answer from MW will be that this is a design feature of Matlab, although I expect that at least some of the problems described can be eliminated by improving the JIT. The index vectors do not need to be explicitly created. That is a simple loop. Or maybe they are not, and the overhead comes from elsewhere? – angainor Nov 15 '12 at 16:54
  • TMW people rarely talk about implementation details, even less so if it's off the record. Unfortunately, they rarely seem to be on SO anymore since they started a SO clone with Matlab Answers. You may get some generic answer such as "we're working on it" when you report the performance issue, but as long as they forward the issue to the developers, somebody knows that we know that there's work to be done. – Jonas Nov 15 '12 at 17:06
  • @Jonas I have submitted this [question to Matlab Central](http://www.mathworks.com/matlabcentral/answers/54522-why-is-indexing-vectors-matrices-in-matlab-very-inefficient). Maybe some engineers get interested. – angainor Nov 22 '12 at 16:38
6

Just a quick note to show how in 8 years of development, the performance characteristics of MATLAB have changed a lot.

This is on R2017a (5 years after OP's post):

Elapsed time is 0.000079 seconds.    % x = zeros(1,1e8);
Elapsed time is 0.101134 seconds.    % x(:) = 1;
Elapsed time is 0.578200 seconds.    % x(1:end) = 2;
Elapsed time is 0.569791 seconds.    % x(1:1e8) = 3;
Elapsed time is 1.602526 seconds.    % id = 1:1e8; x(id) = 4;
Elapsed time is 0.373966 seconds.    % for i=1:numel(x), x(i) = 5; end
Elapsed time is 0.374775 seconds.    % for i=1:1e8, x(i) = 6; end

Note how the loop for 1:numel(x) is faster than indexing x(1:end), it seems that the array 1:end is still being created, whereas for the loop it is not. It is now better in MATLAB to not vectorize!

(I did add an assignment x(:)=0 after allocating the matrix, outside of any timed regions, to actually have the memory allocated, since zeros only reserves the memory.)


On MATLAB R2020b (online) (3 years later) I see these times:

Elapsed time is 0.000073 seconds.    % x = zeros(1,1e8);
Elapsed time is 0.084847 seconds.    % x(:) = 1;
Elapsed time is 0.084643 seconds.    % x(1:end) = 2;
Elapsed time is 0.085319 seconds.    % x(1:1e8) = 3;
Elapsed time is 1.393964 seconds.    % id = 1:1e8; x(id) = 4;
Elapsed time is 0.168394 seconds.    % for i=1:numel(x), x(i) = 5; end
Elapsed time is 0.169830 seconds.    % for i=1:1e8, x(i) = 6; end

x(1:end) is now optimized in the same as x(:), the vector 1:end is no longer being explicitly created.

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120