3

I've been thinking about how to implement something that, frankly, is beyond my mathematical skills. So here goes, feel free to try and point me in the right direction rather than complete code solutions any help I'd be grateful for.

So, imagine I've done an analysis of text and generated a table of the frequencies of different two-character combinations. I've stored these in a 26x26 array. eg.

  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 15 (frequency of AA, then frequency of AB etc.)
B 12 0 (freq of BA, BB etc..)
... etc.

So I want to randomly choose these two-character combinations but I'd like to 'weight' my choice based on the frequency. ie. the AB from above should be 15 times 'more likely' than AA. And, obviously, the selection should never return something like BB (ie. a frequency of 0 - in this example, obviously BB does occur in words like Bubble!! :-) ). For the 0 case I realise I could loop until I get a non-0 frequency but that's just not elegant because I have a feeling/intuition that there is a way to skew my average.

I was thinking to chose the first char of my pair - ie. the row - (I'm generating a 4-pair-sequence ultimately) I could just use the system random function (Random class.Next) then use the 'weighted' random algorithm to pick the second char.

Any ideas?

svick
  • 236,525
  • 50
  • 385
  • 514
Th3Minstr3l
  • 322
  • 2
  • 7

6 Answers6

5

Given your example sample, I would first create a cumulative series of all of the numbers (1, 15, 12, 0 => 1, 16, 28, 28).

Then I would produce a random number between 0 and 27 (let's say 19).

Then I would calculate that 19 was >=16 but <28, giving me bucket 3 (BA).

paul
  • 21,653
  • 1
  • 53
  • 54
  • This works perfectly for what I need, thank you. So now I'm selecting my first letter (row) by using a CryptoRNG then turning the columns of that row into a cumulative series. Generating a second Random Number and picking my second character out of the bucket I end up in. Thank you! :-) – Th3Minstr3l Feb 11 '13 at 11:17
5

There are some good suggestions in the other answers for your specific problem. To solve the general problem of "I have a source of random numbers conforming to a uniform probability distribution, but I would like it to conform to a given nonuniform probability distribution", then you can work out the quantile function, which is the function that performs that transformation. I give a gentle introduction that explains why the quantile function is the function you want here:

Generating Random Non-Uniform Data In C#

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • @OP: Just to link this to paul's answer, you could normalize paul's answer (i.e., generate a random number between 0 and 1 instead of 0 and the [Total Weight], then divide the bucket ranges by [Total Weight]). Paul is describing how to generate a quantile function in your specific case, though he omits normalizing the ranges to `0..1`. – Brian Jan 30 '13 at 18:32
  • Thanks to Eric Lippert for that article, very useful in implementing my solution. See my reply to Paul's post. – Th3Minstr3l Feb 11 '13 at 11:20
1

How about summing all the frequencies and using that from AA to ZZ to generate your pair.

Lets say you have a total frequency of pairs if the rnd return 0 you get AA if it returns 1-14 then its AB etc

Rob Foran
  • 47
  • 4
1

Use your frequency matrix to generate a complete set of values. Order the set by Random.Next(). Store the randomized set in an array. Then you can just select an element out if that array based on Random.Next(randomarray.Length).

If there is a mathematical way to calculate the frequency you could do that as well. But creating a precompiled and cached set will reduce the calculation time if this is called repeatedly.

As a note, depending on the max frequency this could require a good amount of storage. You would also want to create the instance of random before you loop to build the set. This is so you don't reseed the random generator.

...

Another way (similar to what you suggested at the end of your question) would be to do this in two passes with the first selecting the row and the second used your weighted frequency to select the column. That would just be the sum of the row frequencies bounded over a ranges. The first suggestion should give a more even distribution based on weight.

Matthew Whited
  • 22,160
  • 4
  • 52
  • 69
0

Take the sum of the probabilities. Take a random number between zero and that sum. Add up the probabilities until you get it's greater than or equal to your random number. Then use the item your on.

Eg pseudocode:

b = getProbabilites()
s = sum(b)
r = randomInt() % s
i = 0
acc = 0
while (acc < r) {
    acc += b[i]
    i++
}

return i
Triss Healy
  • 478
  • 5
  • 8
0

If efficiency is not a problem, you could create a key->value hash instead of an array. An upside of this would be that (if you format it well in the text) it would be very easy to update the values should the need arise. Something like

{
    AA => 5, AB => 2, AC => 4,
    BA => 6, BB => 5, BC => 9,
    CA => 2, CB => 7, CC => 8
}

With this, you could easily retrieve the value for the sequence you want, and quickly find the entry to update. If the table is automatically generated and extremely large, it could help to get/be familiar with vim's use of regular expressions.

Andy
  • 3,132
  • 4
  • 36
  • 68