Why is a list of cumulative frequency sums required for implementing a random word generator?

Question

I'm working on exercise 13.7 from Think Python: How to Think Like a Computer Scientist. The goal of the exercise is to come up with a relatively efficient algorithm that returns a random word from a file of words (let's say a novel), where the probability of the word being returned is correlated to its frequency in the file.

The author suggests the following steps (there may be a better solution, but this is assumably the best solution for what we've covered so far in the book).

Create a histogram showing {word: frequency}.
Use the keys method to get a list of words in the book.
Build a list that contains the cumulative sum of the word frequencies, so that the last item in this list is the total number of words in the book, n.
Choose a random number from 1 to n.
Use a bisection search to find the index where the random number would be inserted in the cumulative sum.
Use the index to find the corresponding word in the word list.

My question is this: What's wrong with the following solution?

Turn the novel into a list t of words, exactly as they as they appear in the novel, without eliminating repeat instances or shuffling.
Generate a random integer from 0 to n, where n = len(t) – 1.
Use that random integer as an index to retrieve a random word from t.

Thanks.

score 1 · Answer 1 · answered Sep 14 '14 at 17:45

1

Your approach is (also) correct, but it uses space proportional to the input text size. The approach suggested by the book uses space proportional only to the number of distinct words in the input text, which is usually much smaller. (Think about how often words like "the" appear in English text.)

answered Sep 14 '14 at 17:45

j_random_hacker

50,331
10
105
169

Thank you for your response. Follow-up question: Doesn't the conversion of the novel into a dictionary/histogram also use space proportional to the input text size? – doubledherin Sep 14 '14 at 19:03
1

@doubledherin: No, because you don't have to read the entire novel into memory in order to create the dictionary and histogram. You could read as little as a word at a time. The dictionary and histogram have size proportional to the number of distinct words, which is usually much smaller than the total number of words. – rici Sep 14 '14 at 20:13

Why is a list of cumulative frequency sums required for implementing a random word generator?

1 Answers1