1

I have 3 classification rarity with probability something like this

class S has 10% probability

class A has 30% probability

class B has 60% probability

So i code like this

pool = ["S", "A", "A", "A", "B", "B", "B", "B", "B", "B"]
11.times do
 puts pool[rand(10) - 1]
end

and the result is quite correct in my guest (cmiiw)

B
A
B
B
S
B
A
S
B
A
B

but i become confuse when i should add more class and change S class probability to 1%

pool now become

class S has 1% probability

class A has 29% probability

class B has 30% probability

class C has 40% probability

i am not sure i should create pool like pool variable before because 1% is 1/10 is not integer.

Kindly need help, thanks!

pjs
  • 18,696
  • 4
  • 27
  • 56
WALVISK
  • 591
  • 4
  • 16
  • 4
    This might help: https://stackoverflow.com/questions/19261061/picking-a-random-option-where-each-option-has-a-different-probability-of-being (could be a duplicate) – Stefan Jan 19 '23 at 16:56

3 Answers3

6

Consider using the freeware gem aliastable. It’s a ruby implementation of the alias method published by A.J. Walker back in 1974. Note: I am the author.

Update: Based on feedback in the comments (thanks @engineersmnky!), the gem now gracefully handles discrete uniform distributions—where all probabilities are equal—based on case 1 described below. It has also been updated to generate values using lazy enumeration.

When the distribution contains unequal probabilities, the alias method requires O(n) time to create a table with n outcomes, then O(1) time per value to generate. By contrast, constructing a CDF from the probability mass function and then doing a binary search for values would take O(n) and O(log n) time respectively.

This little demo shows how to use the gem:

require 'aliastable'

values = %w(S A B C)
probs = [1r/100, 29r/100, 3r/10, 2r/5] # r => Rational
my_distribution = AliasTable.new(values, probs)

p my_distribution.take(1_000_000).tally

produces results such as

{"C"=>400361, "B"=>300121, "A"=>289636, "S"=>9882}


Description of Alias Method Logic

Building the table

When trying to generate values from a discrete set of n items, there are two cases which can do the job in constant time:

  1. There are n outcomes, all of which have equal probability 1/n. Choose values[rand(n)].
  2. n = 2 and the two outcomes have different probabilities. The binary choice can be made using a simple if/else statement.

The alias method builds a table with three rows, primary, alias, and primary_probability, and n columns. Each outcome is assigned to a column as that column’s primary value, and its probability is entered as primary_probability. The alias for that column is initialized to nil. The column can then be classified as belonging to one of three sets based on the probability associated with the primary relative to the average probability 1/n:

  • Deficit set: primary_probability < 1/n
  • Surplus set: primary_probability > 1/n
  • Parity set: primary_probability = 1/n

Note that the total amount of “missing” probability in the deficit set is exactly equal to the total amount of “excess” probability in the surplus set, so the table construction balances the books by stealing probability from one of the surplus columns to bring one of the deficit columns up to parity. The primary value of the column that is stolen from is added as the alias for the the current deficit column. Conditional probability is used to calculate what proportion of that column's probability is owned by the primary value and adjust primary_probability accordingly. The column is then migrated to the parity set. Additionally, the amount of stolen probability has to be decremented from the selected surplus column, which might leave it still having a surplus, or might change its classification to deficit or parity. If the classification is changed, migrate it to the appropriate set.

Repeat this process until all columns are in the parity set. The table construction algorithm is guaranteed to terminate in fewer than n iterations since each iteration migrates one deficit column, and possibly one surplus column as well, to the parity set.

Using the table

Once the table construction is complete we have n columns, each of which has an equal total amount of probability. Each column has either a primary value, or a primary and an alias, and the conditional probability for picking the primary given we landed in that column. The generating algorithm is then:

  • Pick any column with equal probability.
  • Based on the conditional probability of picking the primary given that we landed in this column, return either the primary or the alias.

The coding logic is:

column = rand(n)
if rand < primary_probability[column]
  return primary[column]
else
  return alias[column]
end

Since both steps are constant time, the overall time to generate an outcome is also constant.

pjs
  • 18,696
  • 4
  • 27
  • 56
  • Thank you for introducing me to this algorithm! You are retired? I thought you were just a kid. (I retired in the previous millennium.) – Cary Swoveland Jan 19 '23 at 19:39
  • 1
    You might want to disclose that this is your gem :-) – Stefan Jan 20 '23 at 08:14
  • i already read this solution, and try the gem. it works!. thank you kindly :). but i do not understand the algorithm. and i came back again the next day to read it again slowly, but still have difficulty to follow the algorithm. i want to make gem to calculate a probability. mind to share a book or any resource to learn about it? especially algorithm in your gem? that'll be amazing. thanks @pjs :)) – WALVISK Jan 20 '23 at 14:46
  • @pujidjayanto Several points: 1) Do you need to know the interior workings of a software library function to use it successfully? 2) Calculate a probability != generate values from a given distribution. If you want to do a different task than the one you asked about here, you should write a different question. 3) [Random variate generation in general is a huge and complex topic, well beyond the scope of a stackoverflow question/answer](http://luc.devroye.org/rnbookindex.html). – pjs Jan 20 '23 at 17:44
  • @pujidjayanto I've updated the answer to explain the logic of the algorithm. It still might not make sense to you depending on your comfort level with probability, and with conditional probabilities in particular. Remember, you can always look at the source code of the gem - it's not very long, and since programming languages are supposed to be unambiguous there are often times where the source is clearer than an English description of the logic. – pjs Jan 20 '23 at 21:59
  • I would report this as an issue but your gitlab seems private: `aliastable` 3.0.3 + cannot process even distributions e.g. `AliasTable.new((1..n),[1r/n].*(n))` – engineersmnky Jan 21 '23 at 02:53
  • 1
    @engineersmnky Version 4.0.0 of the gem has been uploaded to [RubyGems.org](https://rubygems.org/gems/aliastable). The docs are currently lagging, but are directly available from [rubydoc.info](https://rubydoc.info/gems/aliastable/AliasTable). I also migrated the repository to BitBucket, my ex-campus is a very quirky place and changed their policy on public access after I retired. – pjs Jan 25 '23 at 19:00
3
pool = 100.times.map do
  r = rand

  if r <= 0.01
    'S'
  elsif r <= 0.30
    'A'
  elsif r <= 0.6
    'B'
  else
    'D'
  end
end
p pool
p pool.tally

This would output something like

["D", "B", "D", ....]
{"D"=>39, "B"=>28, "A"=>31, "S"=>2}

You could also force rand to return an Integer:

r = rand(0..100)

and then check for integers

if r <= 1

or use a case statement and check for ranges like in the in the answer linked by Stefan.

Pascal
  • 8,464
  • 1
  • 20
  • 31
  • 1
    This of course only works for the specific probability distribution given in the question and does not generalize to deal with an arbitrary discrete probability distribution. – Cary Swoveland Jan 19 '23 at 18:53
  • This is a linear search through the outcomes, so it’s O(n) per value generated when there are n outcomes. You can improve the scaling constant by searching from highest probability to lowest (most likely to be found down to least likely), but it’s still O(n). – pjs Jan 20 '23 at 05:18
  • @engineersmnky For comparison purposes, the unix `time` command shows the enumerator version of the aliastable generating a million values in 0.49 sec, or 0.39 sec with `--yjit`. This is with Ruby 3.2.0 on an M1 MacBook Pro w/ Ventura. – pjs Jan 20 '23 at 18:00
  • 1
    @pj [Benchmarks](https://replit.com/@engineersmnky/ProbabilityComp#main.rb) using ruby libraries but this is under 2.7.4 because that is what the REPL currently has available. Your solution is still far superior. – engineersmnky Jan 20 '23 at 19:05
2

Just for fun I adpated @Pascal's answer to a class with an Enumerator

require 'bigdecimal'

class InfiniteProbabilityDrive 
  include Enumerable
  attr_reader :probabilities

  def initialize(probabilities)
    @probabilities = prioritize(probabilities)
    @enum = Enumerator.new do |y| 
       v = rand
       loop do 
         y << @probabilities.find {|element, stat|  v <= stat}&.first
         v = rand
       end 
    end 
    warn "#{(1 - @probabilities.values.last).*(100).abs.to_f}% chance of Improbability" unless @probabilities.values.last == 1
  end 

  def each(&block)
    @enum.each(&block)
  end 

  def next 
    @enum.next
  end

  private 
    def prioritize(probabilities)
      probabilities
        .select{|_k,v| v.positive?}
        .transform_values { |v| BigDecimal(v, Float::DIG + 1) }
        .sort_by { |_k, v| -v }
        .tap { |a| a.each_cons(2) { |b, c| c[1] = b[1] + c[1] } }
        .to_h
    end
end 

You can use as follows:

pb = InfiniteProbabilityDrive.new(s: 0.01, b: 0.6, a: 0.3, d: 0.09)
pb.take(10)
#=> [:a, :b, :d, :d, :b, :b, :b, :b, :d, :b]
# or just keep generating when needed 
pb.next 
#=> :a
pb.next 
#=> :b 

Running the same deterministics as @pjs answer

pb.take(1_000_000).tally
# => {:a=>300033, :b=>600362, :d=>89687, :s=>9918}
engineersmnky
  • 25,495
  • 2
  • 36
  • 52
  • 1
    pascal answer is more easy to understand for my level now, but your answer make me want to learn enumerator more. i think it is very useful for my case. i want to make gem for calculating probability. thanks @engineersmnky – WALVISK Jan 20 '23 at 14:40
  • 1
    ha, we have a different understanding of "fun" :-) just kidding. Using an Enumerator is nice. – Pascal Jan 23 '23 at 13:32