You can do an n-gram analysis of some body of text and use that as a base for the bias. You can do this either by letters or by syllables. Doing the analysis by syllables is probably more complicated.
To do it by letters, it's easy. You iterate through each character in the source text and keep track of the last n-1 characters you came across. Then, for each next character, you add the last n-1 characters and this new one (a n-gram) to your table of frequencies.
What does this table of frequencies look like? You can use a map mapping the n-grams to their frequencies. But this approach is not very good for the algorithm I suggest below. For that it's better to map each (n-1)-grams to a map of the last letter of an n-gram to its frequency. Something like: std::map<std::string, std::map<char,int>>
.
Having made the analysis and collected the statistics, the algorithm would go like this:
- pick a random starting n-gram. Your previous analysis may contain weighted data for which letters usually start words;
- from all the n-grams that start with previous n-1 letters, pick a random last letter (considering the weights from the analysis);
- repeat until you reach the end of a word (either using a predefined length or from data about word ending frequencies);
To pick random values from a set of values with different weights, you can start by setting up a table of the cumulative frequencies. Then you pick a random number between less than the sum of the frequencies, and see in what interval it falls.
For example:
- A happens 10 times;
- B happens 7 times;
- C happens 9 times;
You build the following table: { A: 10, B: 17, C: 26 }. You pick a number between 1 and 26. If it is less than 10, it's A; if it's greater or equal to 10, but less than 17, it's B; if it's greater than 17, it's C.