2

I'd like to pick up a item according to its given probability from a hashtable. For example, I am storing string "apple" "banana" and "pineapple" into my hashtable. Now I'd like to get a item out of the hashtable according to their given probability, saying the probability to get "apple" is 30%, a "banana" is 30% and a "pineapple" is 40%. Could anyone help me with this?

The reason I need to use a Hashtable is that I am actually dealing a large amount of strings which are the words in a certain book. The probability of the word is depended on its occurrence in the book. For example, if there are 100,000 words in a certain book and the word "dog" occurs 1,000 times. The probability for me to get a "dog" when I am calling from my function should be 1,000/100,000.

J. Ye
  • 73
  • 6

2 Answers2

2

This is your array of items:

[apple, banana, pineapple]

This is your array of probabilities:

[0.3, 0.3, 0.4]

This is your array of cumulative probabilities:

[0.3, 0.6, 1.0]

To pick a random item according to their probabilities, pick a random number R in the range [0, 1], then select the first item whose cumulative probability is greater than or equal to R.

For example, if you generate R = 0.52839, you choose banana, because 0.6 is the first item whose cumulative probability is greater than or equal to R.

You can binary search for the item specified by R, so this is a log(n) solution.

I don't know of any way in which a hashtable is going to help you here. Simple arrays suffice.

Timothy Shields
  • 75,459
  • 18
  • 120
  • 173
  • Hi, thank you for the answer. The reason I am using Hashtable is that I am actually dealing with a large amount of string which are all the words in a book. The probability of each word is calculated according to its occurrence. That's why I think of using hashtable in this case. Could you help me with this? – J. Ye Nov 05 '13 at 18:24
  • @JiaxinAmelechYe Even if you have a very large number of items, you'll still want to do it this way. – Timothy Shields Nov 05 '13 at 18:38
  • @Timothly Shields I think my problem is when I am creating the probabilities array from the book, I need to know the occurrence of each word. I'd like to use a Hashtable to count the occurrence of the words and I don't want to copy the element into an array again when I want pick the words according to their probability. So seems that creating the array is the only option? – J. Ye Nov 05 '13 at 18:48
  • @JiaxinAmelechYe Using a hashtable during the counting phase is very sensible. Once you have counted all of the words in your book, so that you have a hash table mapping words to counts, you can then convert this to two arrays: the array of items (ordered arbitrarily) and the array of counts (as doubles/reals). Then you convert those counts to cumulative counts. Then you normalize. – Timothy Shields Nov 05 '13 at 18:58
1

You should consider using an Alias Table. It's a very efficient method for dealing with large numbers of unequal probabilities.

Community
  • 1
  • 1
pjs
  • 18,696
  • 4
  • 27
  • 56
  • Thanks! I think your answer is basically the same as @Timothly Shields's – J. Ye Nov 05 '13 at 22:51
  • @JiaxinAmelechYe No, it's not. Timothy Shield's approach still searches through the list of alternatives every time, even if it's a binary search. The alias table technique is constant time per operation after the table is constructed. – pjs Nov 05 '13 at 23:12
  • ohh! i see, let me check it carefully – J. Ye Nov 14 '13 at 06:31