Count the number of unique values for each column of a submatrix in a fast manner

Question

I have a matrix X with tens of rows and thousands of columns, all elements are categorical and re-organized to an index matrix. For example, ith column X(:,i) = [-1,-1,0,2,1,2]' is converted to X2(:,i) = ic of [x,ia,ic] = unique(X(:,i)), for convenient use of function accumarray. I randomly selected a submatrix from the matrix and counted the number of unique values of each column of the submatrix. I performed this procedure 10,000 times. I know several methods for counting number of unique values in a column, the fasted way I found so far is shown below:

mx = max(X);
for iter = 1:numperm
    for j = 1:ny
        ky = yrand(:,iter)==uy(j);
        % select submatrix from X where all rows correspond to rows in y that y equals to uy(j)
        Xk = X(ky,:);
        % specify the sites where to put the number of each unique value
        mxj = mx*(j-1);
        mxi = mxj+1;
        mxk = max(Xk)+mxj;
        % iteration to count number of unique values in each column of the submatrix
        for i = 1:c
            pxs(mxi(i):mxk(i),i) = accumarray(Xk(:,i),1);
        end
    end
end

This is a way to perform random permutation test to calculate information gain between a data matrix X of size n by c and categorical variable y, under which y is randomly permutated. In above codes, all randomly permutated y are stored in matrix yrand, and the number of permutations is numperm. The unique values of y are stored in uy and the unique number is ny. In each iteration of 1:numperm, submatrix Xk is selected according to the unique element of y and number of unique elements in each column of this submatrix is counted and stored in matrix pxs.

The most time costly section in the above code is the iterations of i = 1:c for large c.

Is it possible to perform the function accumarray in a matrix manner to avoid for loop? How else can I improve the above code?

-------

As requested, a simplified test function including above codes is provided as

%% test
function test(x,y)

[r,c] = size(x);
x2 = x;
numperm = 1000;

% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
    [~,~,ic] = unique(x(:,i));
    x2(:,i) = ic;
end

% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
    yrand(:,i) = y(randperm(r));
end

% get statistic of y
uy = unique(y);
nuy = numel(uy);

% main iterations
mx = max(x2);
pxs(max(mx),c) = 0;
for iter = 1:numperm
    for j = 1:nuy
        ky = yrand(:,iter)==uy(j);
        xk = x2(ky,:);
        mxj = mx*(j-1);
        mxk = max(xk)+mxj;
        mxi = mxj+1;
        for i = 1:c
            pxs(mxi(i):mxk(i),i) = accumarray(xk(:,i),1);
        end
    end
end

And a test data

x = round(randn(60,3000));
y = [ones(30,1);ones(30,1)*-1];

Test the function

tic; test(x,y); toc

return Elapsed time is 15.391628 seconds. in my computer. In the test function, 1000 permutations is set. So if I perform 10,000 permutation and do some additional computations (are negligible comparing to the above code), time more than 150 s is expected. I think whether the code can be improved. Intuitively, perform accumarray in a matrix manner can save lots of time. Can I?

Your description is a bit hard to follow.. Could you please add some code to the question that generates an input similar to what you have (e.g. using `rng(42528955)` and `randi`), and **show** the expected output ([mcve])? Questions dealing with the improvement of existing code should have, in my opinion, some functioning baseline code. — Dev-iL, Mar 01 '17 at 10:57
@Dev-iL Thanks. A simple function with a testing data is added. — Elkan, Mar 01 '17 at 12:24
You can use `hist` or `histcount`. Assume I have a 5 * 5 array and I want to find histcount of each column. `a=randi(10,5,5);h=hist(a,1:10)` — rahnema1, Mar 01 '17 at 13:35
@rahnema1 This is excellent! It runs five times faster than mine if the loop `for i = 1:c` is replaced by the single line `pxs(mmx*(j-1)+1:mmx*j,:)=hist(x2(ky,:), xbin)`, where `xbin=(0.5:1:max(mx)-0.5)'`. Thanks. — Elkan, Mar 01 '17 at 18:11
But I still wonder whether there exists a built-in function to do same thing. This would be even better. — Elkan, Mar 01 '17 at 18:19
@rahnema1 would you please post your solution as an answer? Elkan - you can do it too - just answer your own question (you don't have to accept the answer). — Dev-iL, Mar 02 '17 at 14:45
@Dev-iL to answer the question I should read whole of the question. I answered title of the question in the comments but it seems that more extra things is considered in the question. — rahnema1, Mar 02 '17 at 15:20
@Dev-iL Only a few revision is made to my code which have been mentioned in my comment, i.e., just change the `for i=1:c;...; end` loop to a single line `pxs(mmx*(j-1)+1:mmx*j,:)=hist(x2(ky,:), xbin)`. New update is use function `histc`. I don't know whether this is enough to be shown as an answer. — Elkan, Mar 03 '17 at 08:03

score 0 · Answer 1 · answered Mar 05 '17 at 09:21

The way suggested by @rahnema1 has significantly improved the calculations, so I posted my answer here, as also requested by @Dev-iL.

%% test
function test(x,y)

[r,c] = size(x);
x2 = x;
numperm = 1000;

% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
    [~,~,ic] = unique(x(:,i));
    x2(:,i) = ic;
end

% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
    yrand(:,i) = y(randperm(r));
end

% get statistic of y
uy = unique(y);
nuy = numel(uy);

% main iterations
mx = max(max(x2));
% preallocation
pxs(mx*nuy,c) = 0;
% set the edges of the bin for function histc
binrg = (1:mx)';
% preallocation of the range of matrix into which the results will be stored
mxr = mx*(0:nuy);
for iter = 1:numperm
    yt = yrand(:,iter);
    for j = 1:nuy
        pxs(mxr(j)+1:mxr(j),:) = histc(x2(yt==uy(j)),binrg);
    end
end

Test results:

>> x = round(randn(60,3000));
>> y = [ones(30,1);ones(30,1)*-1];
>> tic; test(x,y); toc
Elapsed time is 15.632962 seconds.
>> tic; test(x,y); toc % using the way suggested by rahnema1, i.e., revised function posted above
Elapsed time is 2.900463 seconds.

Count the number of unique values for each column of a submatrix in a fast manner

1 Answers1