1

Assume I have an array of items [1,2, ...n], and a probabilities array [p1,p2,....,pn], where n is a very large number and may reach to thousands. The sum of all probabilities equals 1.

  • I need to select 3 unique items randomly each time, an item with a high probability has a higher chance of being selected.

  • I need to do the selection for more than 20k times.

I've implemented a working method by creating a new array that contains the items with repetition based on their probability. For example, if probabilities for item1, item2, and item3 are [2/n,4/n,1/n] respectively, then the new array will contain [1,1,2,2,2,2,3].

It works fine but it's not efficient. Also, using this method there is the possibility of selecting the same item multiple times, then I have to reselect another item which consumes time.

Are there any efficient methods or built-in functions in MATLAB for this purpose?

Wolfie
  • 27,562
  • 7
  • 28
  • 55
userInThisWorld
  • 1,361
  • 4
  • 18
  • 35
  • 1
    I think this answers your question: https://stackoverflow.com/a/13914141/1011724 – Dan Feb 26 '18 at 11:54
  • 1
    Also for the *pick without replacement* issue, if `n`is large and you're only picking 3 values, you probably just want to check for repetitions and repick if found. – Steve Feb 26 '18 at 11:58
  • 1
    Possible duplicate of [Generate random number with given probability matlab](https://stackoverflow.com/questions/13914066/generate-random-number-with-given-probability-matlab) – Steve Feb 26 '18 at 12:03
  • You _specify probabilities_ for the items, but you also want to sample _without replacement_. With those two requirements, it seems difficult to avoid re-picking. For example, [`randsample`](https://es.mathworks.com/help/stats/randsample.html) can handle either requirement, but not both at the same time. Note also that in your current method, if sampled values are not unique you should start over and repick _all_ items, otherwise probabilities are not guaranteed – Luis Mendo Feb 26 '18 at 12:11
  • 1
    This is not a dupe of the linked Q&A, because of the no-replacement requirement here – Luis Mendo Feb 26 '18 at 12:17

2 Answers2

0

Your initial arrays

 x = [1, 2, 3];   % 1:n, where n = 3
 p = [2, 4, 1]/7; % probabilities of choosing each element

You can choose an element with given probability using this:

 r = rand; % get random number in range (0,1)
 xi = x(find(cumsum(p) >= r, 1)); % Get x where cumulative probability >= random number

You want to choose without replacement, so let's remove the element each loop

k = 2;           % number of choices
r = rand(k,1);   % random numbers
xi = zeros(k,1); % output choices
for ii = 1:k     % choices loop
    % Choose x with probability of each element contained in p
    idx = find(cumsum(p) >= r(ii), 1);
    xi(ii) = x(idx);
    % Remove item from lists
    x(idx) = []; p(idx) = [];
    % Rescale probabilities
    p = p/sum(p);
end

With this method, duplicate x entries will be treated as independent, with respect to their own specified p values.

Note: if you want to do this selection N times then use some temporary x and p variables, so that the next time you do the selection you haven't removed some elements! Or even better, vectorise the selection loop and make x and p into N×n, removing an element from each row each time.

Wolfie
  • 27,562
  • 7
  • 28
  • 55
-1

The function datasample does what you ask for, if you use it with the optional argument Weights and then your weight array.

Note: datasample uses the old function histc to generate the sample, whereas the newer histcounts is recommended. Though this is only a problem if you want better efficiency. Quote from the documentation:

histc is not recommended. Use HISTCOUNTS instead.

Nicky Mattsson
  • 3,052
  • 12
  • 28