Weighted pick with filters

Question

I have a list of elements with weights:

{ id1, weight1 },
{ id2, weight2 },
...
{ idN, weightN }

Weights are small integers (say, less than 1000, often less than 50). Total number of ids in the list is less than 1000 as well. (Each id is listed only once.)

For each query I have to return a "random enough" element from the list. If I do E queries, where E is proportional to the sum of all weights, the same number of times each element element must be exactly proportional to its weight value. Note that this should work for smaller values of E (say, up to 50 * sum of weights). See also note at the end of the question.

So far so good, I'd solve this task by putting element ids into a circular list, duplicating them the weight times, then shuffling the list. Each query returns head of the list, and then increments head position.

But in this case I have one additional condition:

I have additional parameter to the query: a filter. A filter is a map of id => is_enabled. If is_enabled is false for a given id, that id should be excluded from the results. The E value in the above restriction is calculated only for enabled elements. That is, disabled element weights are to be excluded from the query.

Filters are "unique" for each query and include entries for each id in the list. (Note that this implies 2^1000 potential filter values.)

Is there a way to solve this efficiently? I need the algorithm to be efficient on a multi-server cluster.

Note 1: I want to stress that, as I believe, selecting elements totally at random (as suggested in one of the answers), without storing any state, will not work. It will give exactly proportional number of elements only on infinite number of queries. Random number generator has full right to return unfair values for a long period of time.

Note 2: This task imposes no restrictions on the quality of the randomness. Come to think about it, it is not even necessary to shuffle the list in the simple-case solution above. Good randomness is better, but not necessary at all.

Note 3: Please note that 2^1000 potential filter values does mean that I can not store anything, associated with the filter value -- it will require too much memory. I can store something for the most recent (or often used) filters, but I can't store things like item list offset, as I can't afford to lose that data.

Note 4: We can't return metainformation with the query and let clients to store the state for us (good idea anyway, thanks, Diacleticus). We can't because two clients may accidentally use the same filter (some filters are more popular than others). In this case we must use the same state for both queries. In fact, client doing more than one query is a relatively rare event.

Can you explain why simply filtering the expanded list according to the filter criteria doesn't by itself ensure that the elements remain in strict proportion to each other? — Ian Mercer, Nov 13 '10 at 03:31
It does. If I can store position in the filtered somewhere. But I can't. — Alexander Gladysh, Nov 13 '10 at 03:35
@Alexander I did not understand your last comment. Why can't you store positions? — Dr. belisarius, Nov 13 '10 at 03:53
Because I have 2^1000 potential filter values. I do not have that much memory. — Alexander Gladysh, Nov 13 '10 at 03:55
Ahh ok. That are not positions, but complete lists. Understood. — Dr. belisarius, Nov 13 '10 at 03:57
Even if we're talking about positions. I can't store even a single byte per filter value -- that would be 2^1000 (10^301) bytes. — Alexander Gladysh, Nov 13 '10 at 03:59
Can we return meta-information in the query (and leave it to clients to store 2^1000 integers for us)? — Dialecticus, Nov 13 '10 at 09:50
We can't because two clients may accidentally use the same filter (some filters are more popular than others). In this case we must use the same state for both queries. In fact, client doing more than one query is a relatively rare event. — Alexander Gladysh, Nov 13 '10 at 12:30
@Hightechrider: If it was not clear and was *not* changing, then I'd agree with you. :-) — Alexander Gladysh, Nov 13 '10 at 12:32

Dialecticus · Answer 1 · 2010-11-13T02:33:30.307

0

It seems to me that you must keep a track for each different filter. This means that you must build a new shuffled list every time a new filter is introduced or when all elements are spent for old filter.

EDIT: Now that we work with proportional values we can remove the shuffled list altogether, and let statistics shuffle it for us. For each query set one counter to random(0..sum_of_all_enabled_weights_for_the_query). Go from start of the list, and subtract from this counter all weights that you come along if the element is enabled for the query, and just ignore it if it is disabled. If counter becomes negative then you found yourself an element.

edited Nov 13 '10 at 02:33

answered Nov 13 '10 at 01:24

Dialecticus

16,400
7
43
103

Too many filter variants (2^1000). I will not have enough memory for that. – Alexander Gladysh Nov 13 '10 at 01:31
How should algorithm behave in case there really are all 2^1000 active filters? Should it keep track somehow for each filter, or can the requirement of **exact** number of elements per filter be relaxed somehow? – Dialecticus Nov 13 '10 at 02:02
Actually it is not exact, but exactly proportional to. I've updated the question text. – Alexander Gladysh Nov 13 '10 at 02:05
This will not work for smaller number of queries. – Alexander Gladysh Nov 13 '10 at 02:35
I think it works with any number of queries (even just one), but maybe I don't understand what you mean. Does "proportional" means "proportional to its probability of occurring" or something else? – Dialecticus Nov 13 '10 at 02:42
Not proportional to the probability, but proportional to the exact weight (not sure if it is not the same thing here though). Anyway, I've added a note 1 to the question regarding why this will not work for me. Tell me if I did not explain myself well enough. – Alexander Gladysh Nov 13 '10 at 02:55
(Or if you think that I'm wrong, of course :-) ) – Alexander Gladysh Nov 13 '10 at 03:09

Dr. belisarius · Answer 2 · 2010-11-13T23:04:04.140

Let's see if I understood your question.

I'll post the code in Mathematica step by step, and the commented output to follow it easily.

This answer provides a deterministic and ordered output (ie non-shuffling). If you really need a random permutation, you generate a full filtered sequence in advance with this same algorithm, shuffle it, and consume the values one by one.

The program

Fist we define two constants:

n = 10; (* nbr of ids *)
m = 3;  (* max weight - 1 *)

I keep the numbers small so we can check the output step by step.

Now we define a random { id, weight} table to work with. We use prime numbers as ids:

weights = Table[{Prime@k, RandomInteger[m] + 1}, {k, n}]

Output:

{{2, 3}, {3, 2}, {5, 3}, {7, 1}, {11, 1}, 
{13, 3}, {17, 1}, {19,4}, {23, 1}, {29, 2}}

Next we accumulate the weights values

accumulator = Accumulate[Table[k[[2]], {k, weights}]]

Output:

{3, 5, 8, 9, 10, 13, 14, 18, 19, 21}

And we merge both tables to get the accumulators into the id table:

weightsAcc = MapThread[Append, {weights, accumulator}]

Output:

{{2, 3, 3}, {3, 2, 5}, {5, 3, 8}, {7, 1, 9}, {11, 1, 10}, 
 {13, 3, 13}, {17, 1, 14}, {19, 4, 18}, {23, 1, 19}, {29, 2, 21}}

Now we initialize the filter, with your default values (true or false). I used True:

filter = Table[{k[[1]], True}, {k, weights}]

Output:

{{2, True}, {3, True}, {5, True}, {7, True}, {11, True}, {13, True}, 
 {17, True}, {19, True}, {23, True}, {29, True}}

The trick is to keep the filter synchronized with the ids vector, so we define a function to update the filter in that way:

updateFilter[filter_, newValuePair_] :=Return@
         ReplaceAll[filter, {newValuePair[[1]], x_} -> newValuePair];

And use it to change two values:

filter = updateFilter[filter, {2, False}];
filter = updateFilter[filter, {5, False}];
Print@filter

Output:

{{2,False},{3,True},{5,False},{7,True},{11,True},{13,True},
 {17,True},{19,True},{23,True},{29,True}}

Now we define our query. We'll use two global vars (agrhhhh!) and two functions to get the thing synchronized:

i = 1; j = 0; (* GLOBAL state variables *)

Adjustij[w_] := (                      (* parm w is weightsAcc *)
   j++;                                (* increment accumulator comparator*)
   If[j == w[[i, 3]], i++];            (* if current id exhausted, get next*)
   If[i == Length@w, i = 1; j = 0];    (* wraparound table when exhausted*)
);   

query[w_, filter_] :=                  (* parm w is weightsAcc *)
 (
  Adjustij[w];
  While[Not@filter[[i, 2]], Adjustij[w]];       (* get non filtered ids only *)
  Return[w[[i, 1]]];
  )

Of course the while loop could be accelerated just skipping the ids with filter False, but I think the intention is clearer this way.

Now we execute the query 30 times:

 Table[query[weightsAcc, filter], {30}]

and get:

{3, 3, 7, 11, 13, 13, 13, 17, 19, 19, 19, 19, 23, 3, 3, 7, 11, 13, \
 13, 13, 17, 19, 19, 19, 19, 23, 3, 3, 7, 11}

Which is our list (cyclically) with the proper weights, except those values with the filter in FALSE.

HTH!

Edit: Server and client code splitted to answer comments

It can process concurrent querys with different filters

The filter state is stored at the client.

Server-Implemented functions and code:

Clear["Global`*"];

(*Server Implemented  Functions follows*)

AdjustFilterState[fs_] := Module[{i, j}, (    (*fs = filterstate, i,j localvars*)
     i = fs[[1]]; (*local vars*)              (*w  = weights with accs*)
     j = fs[[2]];
     j++;                                     (* increment accumulator comparator*)
     If[j == weightsAcc[[i, 3]], i++];        (* if current id exhausted, get next*)
     If[i == Length@weightsAcc, i = 1; j = 0];(* wraparound table when exhausted*)
     Return[{i, j}];);
   ];


query[filter_, fs_] := Module[{fsTemp},       (*fs = filterstate*)
   (
    fsTemp = AdjustFilterState[fs];           (* local var *)

    While[Not@filter[[fsTemp[[1]], 2]],       (* get non filtered ids only *)
       fsTemp = AdjustFilterState[fsTemp]
    ];

    Return[{weightsAcc[[fsTemp[[1]], 1]], fsTemp}]; (*return[value,{filterState}]*)
   )
   ];

initFilter[] := masterFilter; (*Init filters to your defult vallue*)

(*The trick is to get the filter coordinated with the list value*)
updateFilter[f_, newValuePair_] :=
 Return@ReplaceAll[f, {newValuePair[[1]], x_} -> newValuePair];

(*Server Code - Just initialize the whole thing
   The SERVER stores ONLY the weights vectors and a master filter initialized*)

n = 10; (* nbr of ids *)                                (*init vars*)
m = 3;  (*max weight - 1 *)

weights = Table[{Prime@k, RandomInteger[m] + 1}, {k, n}]; (*random weights to test*)
accumulator = Accumulate[Table[k[[2]], {k, weights}]];    
weightsAcc = MapThread[Append, {weights, accumulator}];   (*add acummulator to list*)
masterFilter= Table[{k[[1]],True}, {k,weights}]; (* only ONE virgin filter in server*)

Client code:

(* Client Code 
   The CLIENT stores only the filter and the filterState*)
(* Set up filter and filterstate *)

filter = initFilter[];
filter = updateFilter[filter, {2, False}];  (*specify particular values*)
filter = updateFilter[filter, {5, False}];

filterState = {1,0}; (* these replace the previous GLOBAL state variables *)

ValuesList = {};  (*for storing results *)

Do[
 q1 = query[filter, filterState]; (* do the query *)
 AppendTo[ValuesList, q1[[1]]];   (* first element of return is the value *)
 filterState = q1[[2]];           (* second element is updated filter state *)
 , {30}  (*do 30 times*)
 ];
Print@ValuesList                 (* print results vector *)

If I understand your algorithm, it is sequential. But I simultaneously receive many queries, each with unique filter. — Alexander Gladysh, Nov 13 '10 at 11:30
@Alexander As each query must mantain its own state space (if not, its imposible to guarantee EXACT proportionality) you may mantain the state variables (i,j, filter()) in the client, exposing appropiate functions (query, initfilter and updatefilter). So your query will be Query(*i,*j,filter) to be able to modify i and j in the server. (or anything that allows you to return the three results - i - j - query result - fro the function) — Dr. belisarius, Nov 13 '10 at 14:07
The problem is that state space is associated with filter value, not with query itself. Please see note 4 I've added to the question. (I may be not getting something right though...) — Alexander Gladysh, Nov 13 '10 at 14:13
@Alexander Just read your Note 4. If I understands it, you want to return from the same list to two clients using the same filter?? If so, it seems that you have to manage the whole filter administration on the server side ... that's a lot of overhead. Just think when your "filter session" is going to be reseted ... never? — Dr. belisarius, Nov 13 '10 at 23:12
@belisarius I have to associate state with filter value, this is a hard requirement. Session is to be reset only when id list is changed. — Alexander Gladysh, Nov 15 '10 at 00:13
@Alexander So you need to keep the current filter list and states at the server (coordinatinating clients seems an overkill). The filter list could be implemented as a bit mask and the state is one or two integers depending upon your implementation. But you'll have to check which filter are you receiving upon each client request. If the filter list may grew large you'll have to search better than sequentially (a tree?) — Dr. belisarius, Nov 15 '10 at 01:51
@belisarius I have 2^1000 potential filter values, I can not store any state for them. — Alexander Gladysh, Nov 15 '10 at 13:39
@Alexander you only need to store the state (and true/false values) for the currently used ones. Force the clients to declare a "filter session" and keep track only of those. How many different filters could be used before an id list reset? ... surely MUCH less than 2^1000 ... — Dr. belisarius, Nov 15 '10 at 14:12

score -1 · Accepted Answer · edited May 23 '17 at 10:33

Perhaps I've found a solution:

Store id->number_of_queries_left, where initial value for number_of_queries_left is, say, weight * 10 (so the list is not refreshed too often -- exactly proportional requirement would be kept, I think).
On each query:
1. Pick a random id from filter, where is_enabled is true.
2. Decrement number_of_queries_left for that id.
3. If result is less than or equal to zero, mark that id as used and pick another one.
4. If all values are used and none found, reinitialize id->number_of_queries_left for all ids that are enabled in the filter ("recharge").

Looks like it should work. What do you think?

Update 1:

I'm worried that it looks like that I have to keep id->number_of_queries_left value separate for each filter value. I can't afford that due to memory restrictions (there are 2^1000 potential filter values). Am I right?

Can somebody help me to understand better the consequences of shared number_of_queries_left counter, please?

Update 2:

Credits for the idea go to Diacleticus (see comments to this answer).

What if we don't reset id->number_of_queries_left for all enabled items in the filter, but instead increment them by their respective weights? I think that this should fix the proportions. (Or should it?)

The only problem is that with this algorithm each number_of_queries_left counter can go very negative. (See above, we decrement it each time we want to look at its value.)

So, in a pessimistic case, even by incrementing all counters, we will not bring any of them above zero. This probably is OK, since we'll effectively just run the increment loop until any value will become positive.

Update 3:

No, we can't just run the increment loop until any value will become positive.

This will skewer the weights: that negative part does not have "physical sense" -- it does not represent values, returned from the query.

Thus, a hybrid approach:

When doing "recharge", increment each counter by weight + -min(0, current_counter_value). This should be done atomically, but that looks doable.

Still, I'm not sure that weight handling will be fair in this case.

Comments?

That multiplication with 10 will mess up your requirement of **exact** number of elements. — Dialecticus, Nov 13 '10 at 01:56
I can relax the requirement to be as follows: If I do E queries, where E is a sum of all weights * 10, I should get each element exactly the same number of times as its weight value * 10 — Alexander Gladysh, Nov 13 '10 at 02:04
Actually, I've relaxed it further from exactness to exact proportion match. — Alexander Gladysh, Nov 13 '10 at 02:08
In reply to your update: maybe there is a way to have a function random(max_random, last_random) that is cyclic, and the cycle size is max_random. If such function exists, then for each filter you need to store only one integer, last_random. — Dialecticus, Nov 13 '10 at 02:47
But I can't afford storing even one integer, that's the problem. Even a single byte is too much for 2^1000 items. — Alexander Gladysh, Nov 13 '10 at 02:59
I've edited algorithm a bit. But the question about shared number of queries still stays. — Alexander Gladysh, Nov 13 '10 at 04:14
One sweet improvement to your idea: instead of *10 start with *1, but when you run out of all elements for one query then add *1 weights to all elements. This means that at some extreme point you may reach *10 (or more) for some elements as a consequence of using only one query. — Dialecticus, Nov 13 '10 at 11:28
You mean, don't reset the counter values, but increment them? Nice! But the problem is that counters can go below zero in my case (item 2.2 of the algorithm is responsible for that). But, I guess, I can accommodate somehow. — Alexander Gladysh, Nov 13 '10 at 12:02

Weighted pick with filters

3 Answers3