Short of using clever mathematical tricks to reduce the number of operations, the best shot is to optimize the memory access. That is: avoid subsref
ing, increase the locality of your code, reduce the cache misses by manipulating short arrays instead of large ones.
n = 50;
T = 1000;
p = 10;
Y = zeros(n,n,T);
B = zeros(n,n,p);
A = rand(n);
for t = 1:p
Y(:,:,t) = rand(n);
B(:,:,t) = rand(n);
end
fprintf('Original attempt: '); tic;
for t=(p+1):T
Y(:,:,t) = A;
for k=1:p
Y(:,:,t) = Y(:,:,t) + B(:,:,k)*Y(:,:,t-k);
end;
end;
toc;
%'This solution was taken from Divakar'
fprintf('Using reshaping: '); tic;
Br = reshape(B(:,:,1:p),n,[]);
for t=(p+1):T
Yr = reshape(permute(Y(:,:,t-(1:p)),[1 3 2]),[],n);
Y(:,:,t) = A + Br*Yr;
end;
toc;
%'proposed solution'
Y = cell(1,T);
B = cell(1,p);
A = rand(n);
for t = 1:p
Y{t} = rand(n);
B{t} = rand(n);
end
fprintf('Using cells: '); tic;
for t=(p+1):T
U = A;
for k=1:p
U = U + B{k}*Y{t-k};
end;
Y{t} = U;
end;
toc;
For setups given in my example I get a two-fold speed increase for a decent machine (i5 + 4Gb, MATLAB R2012a). I am curious how well it does on your machine.