Efficient binary sampling from vector of probability distribution vectors in MatLab

Question

I'm tidying up some digit classification code. So I feed in an image of a digit, say "7" and I get out 10 probabilities (i.e. sums to 1). If my algorithm is working well, the 7th element should have the highest value.

An added complication is that I'm working with batches of 100 elements. So I actually have a COLxROW = 100x10 MATRIX where each ROW sums to 1.

Now I wish to sample from each of these 100 distributions, i.e. I need to produce a vector like [0 0 0 1 0 0 0 0 0 0] (that would be a 3) for each batch item according to my probability distribution.

The existing implementation is:

samp = pd*0;
layers = cumsum( pd, 2 );
randoms = rand( batchSize, 1 );
for k = 1:batchSize
    index = find( randoms(k) <= layers(k,:),  1 );
    samp( k, index ) = 1;
end

However I would prefer to avoid explicitly looping (as I have read it is often causes poor performance).

Efficiency is key, as this routine gets executed in the tightest loops.

How to accomplish this efficiently?

EDIT I will attempt to answer my question, I'm posting in case someone can improve upon the answer (there is nearly always more than one way to skin a cat in MatLab) and also as this may constitute a valuable snippet to somebody.

EBH · Answer 1 · 2016-10-06T09:40:47.823

1

Here is a way to avoid the loop:

% preparing some data:
batchSize = 100;
probs = [ones(1,9)*0.01 0.9];
pd = zeros(batchSize,10);
for k = 1:batchSize
    pd(k,:) = probs(randperm(10));
end

% the actual answer:
layers = cumsum(pd,2);
randoms = rand(batchSize,1);
index = 11-cumsum((layers-repmat(randoms,1,10))>0,2);
samp = bsxfun(@eq,index(:,end),1:10);

edited Oct 06 '16 at 09:40

answered Oct 06 '16 at 08:57

EBH

10,350
3
34
59

score 0 · Answer 2 · answered Oct 03 '16 at 16:29

The following seems to work:

function sample = sampleFromPDs( pd )
    [batchSize_, nOutputs] = size( pd );

    bools = cumsum(pd,2) > repmat( rand(batchSize_,1), 1, nOutputs );

    % e.g. 001 111 gives (6+1) - 4 = 3
    indexOfFirstONE = (nOutputs+1) - sum(bools, 2);

    sample = 0 .* pd;  
    sample( ...
        sub2ind( size(pd), 1:batchSize_, indexOfFirstONE' ) ...
        ) = 1;    
end

However, I'm a little concerned that MatLab might be reallocating memory each iteration (when in reality it is always called with the same argument dimensions).

Efficient binary sampling from vector of probability distribution vectors in MatLab

2 Answers2