I have a list of elements with weights:
{ id1, weight1 },
{ id2, weight2 },
...
{ idN, weightN }
Weights are small integers (say, less than 1000, often less than 50). Total number of ids in the list is less than 1000 as well. (Each id
is listed only once.)
For each query I have to return a "random enough" element from the list. If I do E
queries, where E
is proportional to the sum of all weights, the same number of times each element element must be exactly proportional to its weight
value. Note that this should work for smaller values of E
(say, up to 50 * sum of weights). See also note at the end of the question.
So far so good, I'd solve this task by putting element ids into a circular list, duplicating them the weight times, then shuffling the list. Each query returns head of the list, and then increments head position.
But in this case I have one additional condition:
I have additional parameter to the query: a filter. A filter is a map of id => is_enabled
. If is_enabled
is false for a given id
, that id
should be excluded from the results. The E
value in the above restriction is calculated only for enabled elements. That is, disabled element weights are to be excluded from the query.
Filters are "unique" for each query and include entries for each id
in the list. (Note that this implies 2^1000 potential filter values.)
Is there a way to solve this efficiently? I need the algorithm to be efficient on a multi-server cluster.
Note 1: I want to stress that, as I believe, selecting elements totally at random (as suggested in one of the answers), without storing any state, will not work. It will give exactly proportional number of elements only on infinite number of queries. Random number generator has full right to return unfair values for a long period of time.
Note 2: This task imposes no restrictions on the quality of the randomness. Come to think about it, it is not even necessary to shuffle the list in the simple-case solution above. Good randomness is better, but not necessary at all.
Note 3: Please note that 2^1000 potential filter values does mean that I can not store anything, associated with the filter value -- it will require too much memory. I can store something for the most recent (or often used) filters, but I can't store things like item list offset, as I can't afford to lose that data.
Note 4: We can't return metainformation with the query and let clients to store the state for us (good idea anyway, thanks, Diacleticus). We can't because two clients may accidentally use the same filter (some filters are more popular than others). In this case we must use the same state for both queries. In fact, client doing more than one query is a relatively rare event.