I'm working on exercise 13.7 from Think Python: How to Think Like a Computer Scientist. The goal of the exercise is to come up with a relatively efficient algorithm that returns a random word from a file of words (let's say a novel), where the probability of the word being returned is correlated to its frequency in the file.
The author suggests the following steps (there may be a better solution, but this is assumably the best solution for what we've covered so far in the book).
- Create a histogram showing
{word: frequency}
. - Use the
keys
method to get a list of words in the book. - Build a list that contains the cumulative sum of the word frequencies, so that the last item in this list is the total number of words in the book,
n
. - Choose a random number from 1 to
n
. - Use a bisection search to find the index where the random number would be inserted in the cumulative sum.
- Use the index to find the corresponding word in the word list.
My question is this: What's wrong with the following solution?
- Turn the novel into a list
t
of words, exactly as they as they appear in the novel, without eliminating repeat instances or shuffling. - Generate a random integer from 0 to
n
, wheren = len(t) – 1
. - Use that random integer as an index to retrieve a random word from
t
.
Thanks.