Consider using the freeware gem aliastable. It’s a ruby implementation of the alias method published by A.J. Walker back in 1974. Note: I am the author.
Update: Based on feedback in the comments (thanks @engineersmnky!), the gem now gracefully handles discrete uniform distributions—where all probabilities are equal—based on case 1 described below. It has also been updated to generate values using lazy enumeration.
When the distribution contains unequal probabilities, the alias method requires O(n) time to create a table with n outcomes, then O(1) time per value to generate. By contrast, constructing a CDF from the probability mass function and then doing a binary search for values would take O(n) and O(log n) time respectively.
This little demo shows how to use the gem:
require 'aliastable'
values = %w(S A B C)
probs = [1r/100, 29r/100, 3r/10, 2r/5] # r => Rational
my_distribution = AliasTable.new(values, probs)
p my_distribution.take(1_000_000).tally
produces results such as
{"C"=>400361, "B"=>300121, "A"=>289636, "S"=>9882}
Description of Alias Method Logic
Building the table
When trying to generate values from a discrete set of n items, there are two cases which can do the job in constant time:
- There are n outcomes, all of which have equal probability 1/n. Choose
values[rand(n)]
.
- n = 2 and the two outcomes have different probabilities. The binary choice can be made using a simple if/else statement.
The alias method builds a table with three rows, primary
, alias
, and primary_probability
, and n columns. Each outcome is assigned to a column as that column’s primary
value, and its probability is entered as primary_probability
. The alias
for that column is initialized to nil
. The column can then be classified as belonging to one of three sets based on the probability associated with the primary relative to the average probability 1/n:
- Deficit set:
primary_probability < 1/n
- Surplus set:
primary_probability > 1/n
- Parity set:
primary_probability = 1/n
Note that the total amount of “missing” probability in the deficit set is exactly equal to the total amount of “excess” probability in the surplus set, so the table construction balances the books by stealing probability from one of the surplus columns to bring one of the deficit columns up to parity. The primary
value of the column that is stolen from is added as the alias
for the the current deficit column. Conditional probability is used to calculate what proportion of that column's probability is owned by the primary value and adjust primary_probability
accordingly. The column is then migrated to the parity set. Additionally, the amount of stolen probability has to be decremented from the selected surplus column, which might leave it still having a surplus, or might change its classification to deficit or parity. If the classification is changed, migrate it to the appropriate set.
Repeat this process until all columns are in the parity set. The table construction algorithm is guaranteed to terminate in fewer than n iterations since each iteration migrates one deficit column, and possibly one surplus column as well, to the parity set.
Using the table
Once the table construction is complete we have n columns, each of which has an equal total amount of probability. Each column has either a primary value, or a primary and an alias, and the conditional probability for picking the primary given we landed in that column. The generating algorithm is then:
- Pick any column with equal probability.
- Based on the conditional probability of picking the primary given that we landed in this column, return either the primary or the alias.
The coding logic is:
column = rand(n)
if rand < primary_probability[column]
return primary[column]
else
return alias[column]
end
Since both steps are constant time, the overall time to generate an outcome is also constant.