1

There is a hash with IDs and weights of these IDs.

y = { 1 => 0.7, 2 => 0.2, 3 => 0.1 }

I would like to shuffle this hash according to the weights.

I tried a number of different ways, all of which give me similar, unexpected results. Here's the most succinct I found.

y.sort_by {|v| -v[1]*rand()}

When I run this ten thousand times and pick out the first IDs, I get the following counts:

{1=>8444, 2=>1316, 3=>240}

I expected those counts to reflect the weights above (e.g., 1 => 7000). It's a bit foggy to me as to why this shuffling does not match those weights. Can someone clear up my confusion and tell how to fix it?

Here are a few of the helpful sources I found:

Community
  • 1
  • 1
JHo
  • 1,068
  • 1
  • 14
  • 29
  • 2
    An example of why this _won't_ work. Assume we have the hash `{ 1 => 0.7, 2 => 0.3}`. When we're choosing the random weight for 1, it will be bigger than 0.3 exactly 4/7 of the time, and thus definitely larger than the number we pick for 2. The other 3/7 of the time, it will be randomly between 0.0 and 0.3 and have a 1/2 chance of being bigger than the number we pick for 2. So it's ordered first `4/7 + (3/7)*(1/2) == 78.6%` of the time, when it should be ordered first 70% of the time. – JKillian Mar 06 '15 at 04:04
  • What you need to do is construct a (cumulative) distribution function, then, letting `rn = rand` (a number between `0.0` and `1.0`), select `1` if `rn < 0.7`, `2 if 0.7 <= rn < 0.9`, and `3` if rn <= 0.9`. – Cary Swoveland Mar 06 '15 at 05:57

4 Answers4

6

Here's another way to perform weighted random sampling using Enumerable#max_by and this amazing result from Efraimidis and Spirakis:

Given a hash whose values represent probabilities that sum to 1, we can get a weighted random sampling like this:

# hash of ids with their respective weights that sum to 1
y = { 1 => 0.7, 2 => 0.2, 3 => 0.1 }

# lambda that randomly returns a key from y in proportion to its weight
wrs = -> { y.max_by { |_, weight| rand ** (1.0/weight) }.first }

# test run to see if it works
10_000.times.each_with_object(Hash.new(0)) { |_, freq| freq[wrs.call] += 1 }

# => {1=>6963, 3=>979, 2=>2058}

On a side note, there has been talk of adding weighted random sampling to Array#sample, but the feature seems to have got lost in the shuffle.

Further reading:

  1. Ruby-Doc for Enumerable#max_by — specifically the wsample example
  2. Weighted Random Sampling by Efraimidis and Spirakis (2005) which introduces the algorithm
  3. New features for Array#sample, Array#choice which mentions the intention of adding weighted random sampling to Array#sample
O-I
  • 1,535
  • 1
  • 13
  • 13
  • Regarding 2. Weighted Random Sampling: Is there a formula to compute the probability of an item to be at a specific index or a range of indices? – Maxim Lopin Mar 06 '23 at 23:38
2

Here's a most likely inefficient but hopefully effective enough solution: (Although I make no promises about correctness! Plus the code isn't going to make too many Rubyists happy...).

The essence of the algorithm is as simple as picking an element randomly based on its weight, removing it, and then repeating with the remaining elements.

def shuffle some_hash
   result = []

   numbers = some_hash.keys
   weights = some_hash.values
   total_weight = weights.reduce(:+)

   # choose numbers one by one
   until numbers.empty?
      # weight from total range of weights
      selection = rand() * total_weight

      # find which element this corresponds with
      i = 0
      while selection > 0
         selection -= weights[i]
         i += 1
      end
      i -= 1

      # add number to result and remove corresponding weight
      result << numbers[i]
      numbers.delete_at i
      total_weight -= weights.delete_at(i)
   end

   result
end
JKillian
  • 18,061
  • 8
  • 41
  • 74
  • 1
    This works well and is easy to read. I ran it a bunch and it worked as expected. Thank you. – JHo Mar 06 '15 at 14:45
1

You gave the probability density function (P for "proability):

P(1) = 0.7
P(2) = 0.3
P(3) = 0.1

You need to construct the (cumulative) distribution function, which looks like this:

Distribution function

We can now generate random numbers between zero and one, plot them on the Y axis, draw a line to the right to see where they intersect the distribution function, then read the associated X coordinate as the random variate. So if the random number is less than 0.7, the random variate is 1; if is between 0.7 and 0.9, the random variate is 2 and the random variate is 3 if the probability exceeds 0.9. (Note that the probability that rand will equal 0.7 (say) exactly is virtually zero, so we don't have to sorry about distinguishing between < 0.7 and <= 0.7.)

To implement that, first calculate the hash df:

y = { 1 => 0.7, 2 => 0.2, 3 => 0.1 }

last = 0.0
df = y.each_with_object({}) { |(v,p),h| last += p; h[last.round(10)] = v }
  #=> {0.7=>1, 0.9=>2, 1.0=>3}

And now we can create a random variate as follows:

def rv(df)
  rn = rand
  df.find { |p,_| rn < p }.last
end

Let's try it:

def count(df,n)
  n.times.each_with_object(Hash.new(0)) { |_,count|
    count[rv(df)] += 1 }
end

n = 10_000
count(df,n)
  #=> {1=>6993, 2=>1960, 3=>1047} 
count(df,n)
  #=> {1=>6986, 2=>2042, 3=>972} 
count(df,n)
  #=> {1=>6970, 2=>2039, 3=>991} 

Note that the order of the key-value pairs count is determined by the outcomes of the first few random variates, so the keys will not necessarily be in the order they are here.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • 1
    Thank you for your thorough response. The chart certainly makes clear why a CDF is needed. In the end, I chose a different response that made it easy for me to set up a method to shuffle the hash, which is based on a CDF. Thanks again. – JHo Mar 06 '15 at 14:51
0

If you make your weights integer values, like this:

y = { 1 => 7, 2 => 2, 3 => 1 }

Then you could construct an array where the number of occurrences of each item in the array is based on the weights:

weighted_occurrences = y.flat_map { |id, weight| Array.new(weight, id) }
# => [1, 1, 1, 1, 1, 1, 1, 2, 2, 3]

Then doing a weighted shuffle is as simple as:

weighted_occurrences.shuffle.uniq

After 10,000 shuffles and picking out the first IDs, I get:

{
  1 => 6988,
  2 => 1934,
  3 => 1078
}
Matt Brictson
  • 10,904
  • 1
  • 38
  • 43
  • Thank you for the answer. I like the hunger games style lottery, but in the end I decided it would be easier to allow decimal weights than have to convert those to integers. – JHo Mar 06 '15 at 14:47
  • Fair enough. Thanks for the interesting question! I had fun with coming up with an answer. – Matt Brictson Mar 06 '15 at 19:18