I have a matrix X
with tens of rows and thousands of columns, all elements are categorical and re-organized to an index matrix. For example, ith
column X(:,i) = [-1,-1,0,2,1,2]'
is converted to X2(:,i) = ic
of [x,ia,ic] = unique(X(:,i))
, for convenient use of function accumarray
. I randomly selected a submatrix from the matrix and counted the number of unique values of each column of the submatrix. I performed this procedure 10,000 times. I know several methods for counting number of unique values in a column, the fasted way I found so far is shown below:
mx = max(X);
for iter = 1:numperm
for j = 1:ny
ky = yrand(:,iter)==uy(j);
% select submatrix from X where all rows correspond to rows in y that y equals to uy(j)
Xk = X(ky,:);
% specify the sites where to put the number of each unique value
mxj = mx*(j-1);
mxi = mxj+1;
mxk = max(Xk)+mxj;
% iteration to count number of unique values in each column of the submatrix
for i = 1:c
pxs(mxi(i):mxk(i),i) = accumarray(Xk(:,i),1);
end
end
end
This is a way to perform random permutation test to calculate information gain between a data matrix X
of size n by c
and categorical variable y
, under which y
is randomly permutated. In above codes, all randomly permutated y
are stored in matrix yrand
, and the number of permutations is numperm
. The unique values of y
are stored in uy
and the unique number is ny
. In each iteration of 1:numperm
, submatrix Xk
is selected according to the unique element of y
and number of unique elements in each column of this submatrix is counted and stored in matrix pxs
.
The most time costly section in the above code is the iterations of i = 1:c
for large c
.
Is it possible to perform the function accumarray
in a matrix manner to avoid for
loop? How else can I improve the above code?
-------
As requested, a simplified test function including above codes is provided as
%% test
function test(x,y)
[r,c] = size(x);
x2 = x;
numperm = 1000;
% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
[~,~,ic] = unique(x(:,i));
x2(:,i) = ic;
end
% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
yrand(:,i) = y(randperm(r));
end
% get statistic of y
uy = unique(y);
nuy = numel(uy);
% main iterations
mx = max(x2);
pxs(max(mx),c) = 0;
for iter = 1:numperm
for j = 1:nuy
ky = yrand(:,iter)==uy(j);
xk = x2(ky,:);
mxj = mx*(j-1);
mxk = max(xk)+mxj;
mxi = mxj+1;
for i = 1:c
pxs(mxi(i):mxk(i),i) = accumarray(xk(:,i),1);
end
end
end
And a test data
x = round(randn(60,3000));
y = [ones(30,1);ones(30,1)*-1];
Test the function
tic; test(x,y); toc
return Elapsed time is 15.391628 seconds.
in my computer. In the test function, 1000 permutations is set. So if I perform 10,000 permutation and do some additional computations (are negligible comparing to the above code), time more than 150 s
is expected. I think whether the code can be improved. Intuitively, perform accumarray
in a matrix manner can save lots of time. Can I?