11

How can I randomly generate letters according to their frequency of use in common speech?

Any pseudo-code appreciated, but an implementation in Java would be fantastic. Otherwise just a poke in the right direction would be helpful.

Note: I don't need to generate the frequencies of usage - I'm sure I can look that up easily enough.

Tom R
  • 5,991
  • 9
  • 35
  • 41
  • 2
    dupe of http://stackoverflow.com/questions/2073235/random-weighted-choice and many others (search "weighted random generation") – Eli Bendersky Jan 27 '10 at 20:15
  • @Eli: sorry - didn't realise its name. – Tom R Jan 27 '10 at 20:38
  • 1
    `fEnglish = new[] {8.167f,1.492f,2.782f,4.253f,12.702f,2.228f,2.015f,6.094f, 6.966f,0.153f,0.772f,4.025f,2.406f,6.749f,7.507f,1.929f,0.095f,5.987f, 6.327f,9.056f,2.758f,0.978f,2.361f,0.150f,1.974f,0.074f};` and then... – Fattie Jul 11 '16 at 12:16
  • `public static int RandomFromFrequencyArray(this float[] f) { float sum = 0f; foreach (float ff in f) sum += ff; int kF = f.Length; int result = 0; float sumSoFar = f[0]; float percentageResult = Random.Range(0f, sum ); while (sumSoFar < percentageResult) { ++result; sumSoFar += f[result]; if ( result >= kF ) {Debug.Log("woe..."); return (kF-1);} } return result; }` – Fattie Jul 11 '16 at 12:16
  • The frequency array does NOT HAVE TO ADD TO 100. So, it's totally fine to do this: '(new[] {15f,5f,5f,1f}).RandomFromFrequencyArray();` For example the vowels in English... just take the frequencies from the full alphabet frequencies (since it does not have to add to 100)... 'int trueRandomVowel = (new[] {8.167f,12.702f,6.966f,7.507f,2.758f}).RandomFromFrequencyArray(); return ("aeiou".ToCharArray())[v].ToString();' – Fattie Jul 11 '16 at 12:17

5 Answers5

19

I am assuming that you store the frequencies as floating point numbers between 0 and 1 that total to make 1.

First you should prepare a table of cumulative frequencies, i.e. the sum of the frequency of that letter and all letters before it.

To simplify, if you start with this frequency distribution:

A  0.1
B  0.3
C  0.4
D  0.2

Your cumulative frequency table would be:

A  0.1
B  0.4 (= 0.1 + 0.3)
C  0.8 (= 0.1 + 0.3 + 0.4)
D  1.0 (= 0.1 + 0.3 + 0.4 + 0.2)

Now generate a random number between 0 and 1 and see where in this list that number lies. Choose the letter that has the smallest cumulative frequency larger than your random number. Some examples:

Say you randomly pick 0.612. This lies between 0.4 and 0.8, i.e. between B and C, so you'd choose C.

If your random number was 0.039, that comes before 0.1, i.e. before A, so choose A.

I hope that makes sense, otherwise feel free to ask for clarifications!

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
11

One quick way to do it would be to generate a list of letters, where each letter appeared in the list in accordance with its frequency. Say, if "e" was used 25.6% of the time, and your list had length 1000, it would have 256 "e"s.

Then you could just randomly pick spots from the list by using (int) (Math.random() * 1000) to generate random numbers between 0 and 999.

danben
  • 80,905
  • 18
  • 123
  • 145
  • The best way to match the letter frequency of a particular text :-) – phkahler Jan 27 '10 at 20:55
  • 2
    +1 It's a good suggestion, but not ideal if you have characters that occur with very small frequencies (e.g. 0.00001 or less). I guess it depends on what you need. – Mark Byers Jan 27 '10 at 21:00
  • This has obvious precision limits, but might be preferable because it is so simple to implement and understand. – Schamp Jan 27 '10 at 21:20
  • This is indeed the "video game approach" to randomness .. you make an list that has the things you want in the rough frequency you want them - and choose one. – Fattie Nov 30 '15 at 03:01
5

What I would do is scale the relative frequencies as floating point numbers such that their sum is 1.0. Then I would create an array of the cumulative totals per letter, i.e. the number that must be topped to get that letter and all those "below" it. Say the frequency of A is 10%, b is 2% and z is 1%; then your table would look something like this:

0.000 A ; from 0% to 10% gets you an A
0.100 B ; above 10% is at least a B
0.120 C ; 12% for C...
...
0.990 Z ; if your number is >= 99% then you get a Z

Then you generate yourself a random number between 0.0 and 1.0 and do a binary search in the array for the first number smaller than your random number. Then pick the letter at that position. Done.

Carl Smotricz
  • 66,391
  • 18
  • 125
  • 167
4

Not even a pseudo-code, but a possible approach is as follows:

Let p1, p2, ..., pk be the frequencies that you want to match.

  1. Calculate the cumulative frequencies: p1, p1+p2, p1+p2+p3, ... , 1
  2. Generate a random uniform (0,1) number x
  3. Check which interval of the cumulative frequencies x belongs to: if it is between, say, p1+..+pi and p1+...+pi+p(i+1), then output the (i+1)st letter

Depending on how you implement the interval-finding, the procedure is usually more efficient if the p1,p2,... are sorted in decreasing order, because you will usually find the interval containing x sooner.

Aniko
  • 18,516
  • 4
  • 48
  • 45
2

Using a binary tree gives you a nice, clean way to find the right entry. Here, you start with a frequency map, where the keys are the symbols (English letters), and the values are the frequency of their occurrence. This gets inverted, and a NavigableMap is created where the keys are cumulative probability, and the values are symbols. That makes the lookup easy.

  private final Random generator = new Random();

  private final NavigableMap<Float, Integer> table = 
    new TreeMap<Float, Integer>();

  private final float max;

  public Frequency(Map<Integer, Float> frequency)
  {
    float total = 0;
    for (Map.Entry<Integer, Float> e : frequency.entrySet()) {
      total += e.getValue();
      table.put(total, e.getKey());
    }
    max = total;
  }

  /** 
   * Choose a random symbol. The choices are weighted by frequency.
   */ 
  public int roll()
  {
    Float key = generator.nextFloat() * max;
    return table.higherEntry(key).getValue();
  }
erickson
  • 265,237
  • 58
  • 395
  • 493