2

I would like to design a data structure and algorithm such that, given an array of elements, where each element has a weight according to [a,b], I can achieve constant time insertion and deletion. The deletion is performed randomly where the probability of an element being deleted is proportional to its weight.

I do not believe there is a deterministic algorithm that can achieve both operations in constant time, but I think there are there randomized algorithms that should be can accomplish this?

user5965026
  • 465
  • 5
  • 16
  • In what way is this significantly different from your earlier question https://stackoverflow.com/questions/66534443/what-data-structure-can-achieve-random-pop-and-push-in-better-than-on-time? – pjs Mar 08 '21 at 19:39
  • The earlier question mentioned those weights as an example. So you're saying that asking about a specific case constitutes a different question? – pjs Mar 08 '21 at 20:06
  • @pjs Uh no? I just made up an example in the earlier question and didn't even realize I made it so it's similar as to what I'm asking. This is an entirely different question to see if there's anything significant about the weight constraint. If it makes you happier, I'll change the earlier question's example to 2 other numbers. – user5965026 Mar 08 '21 at 21:00
  • The algorithm's performance depends on there being bounds; it is only efficient because the weights cannot be arbitrarily small or arbitrarily large. This would not be a suitable solution in the general case, and it's not trivial to implement it in the general case because the rejection sampling requires generating a random number in the interval [0,b) which means knowing what b is (i.e. knowing the largest possible weight). A data structure which supports efficient max queries and efficient deletion by index would be significantly more complicated. – kaya3 Mar 09 '21 at 05:06
  • @pjs It is not asking about a specific case; in this problem there are more guarantees on the input and those guarantees should be exploited to achieve a more efficient algorithm. It is no more the same problem than searching in an unsorted list (linear search) and searching in a sorted list (binary search) are the same problem; adding a constraint on the input changes it substantially. – kaya3 Mar 09 '21 at 05:11
  • More simply, my answer to this question would be a wrong answer to the other question, and vice versa, my answer to the other question would be a bad answer to this one. So they are not the same question because they do not have the same answer. – kaya3 Mar 09 '21 at 05:15
  • There is nothing special about 1 and 2, but the lower bound *a* must be greater than zero, not equal to zero. As mentioned in my answer, the performance scales according to the ratio *b/a*, so the smaller *a* is (i.e. the smaller the weights are allowed to be), the longer the running time. – kaya3 Mar 09 '21 at 05:24
  • You cannot have negative weights in a probability distribution, and "shifting" by some amount would change the distribution. You seem to be confused about something basic here but it is hard to tell what. Yes, the performance depends on the distribution of the weights in the interval [a,b]; as previously stated, the expected *b/a* iterations is in the case where all weights are *a*, which is the worst case. – kaya3 Mar 09 '21 at 05:35
  • @kaya3 Yeah I think I've been thinking so long about this problem that my mind is going dumb. I don't know why I said negative weights are possible and a simple shift would correct that. That doesn't make sense – user5965026 Mar 09 '21 at 05:37
  • No worries, hopefully I have clarified what you were unsure of, anyway. – kaya3 Mar 09 '21 at 05:38
  • @kaya3 Yeah I think so. How did you come up with the intuition for your randomized algorithm? It wasn't obvious to me initially why it works. – user5965026 Mar 09 '21 at 05:46
  • The intuition is that there is an efficient way to generate a random sample without the weights, and then you can get the correct distribution by discarding the sample with the right probability; it's somewhat analogous to sampling from a uniform disc by sampling from a square and then rejecting the sample if it's outside of the inscribed circle. This wikipedia article might help: https://en.wikipedia.org/wiki/Rejection_sampling – kaya3 Mar 09 '21 at 05:55
  • @kaya3 Yeah, I've definitely used rejection sampling before like the one you mentioned with sampling points from a joint uniform distribution and discarding points out of a square. It never occurred to me that I could approach this problem in the same way. – user5965026 Mar 09 '21 at 05:58

3 Answers3

2

I don't know if O(1) worst-case time is impossible; I don't see any particular reason it should be. But it's definitely possible to have a simple data structure which achieves O(1) expected time.

The idea is to store a dynamic array of pairs (or two parallel arrays), where each item is paired with its weight; insertion is done by appending in O(1) amortised time, and an element can be removed by index by swapping it with the last element so that it can be removed from the end of the array in O(1) time. To sample a random element from the weighted distribution, choose a random index and generate a random number in the half-open interval [0, 2); if it is less than the element's weight, select the element at that index, otherwise repeat this process until an element is selected. The idea is that each index is equally likely to be chosen, and the probability it gets kept rather than rejected is proportional to its weight.

This is a Las Vegas algorithm, meaning it is expected to complete in a finite time, but with very low probability it can take arbitrarily long to complete. The number of iterations required to sample an element will be highest when every weight is exactly 1, in which case it follows a geometric distribution with parameter p = 1/2, so its expected value is 2, a constant which is independent of the number of elements in the data structure.

In general, if all weights are in an interval [a, b] for real numbers 0 < a <= b, then the expected number of iterations is at most b/a. This is always a constant, but it is potentially a large constant (i.e. it takes many iterations to select a single sample) if the lower bound a is small relative to b.

kaya3
  • 47,440
  • 4
  • 68
  • 97
  • I'm confused why we generate a random number in the half open interval of $[0, 2)$. Where did these bounds come from? – user5965026 Mar 08 '21 at 19:55
  • @user5965026 see the example below, and redo it step by step if not convinced – aka.nice Mar 08 '21 at 21:46
  • The worst it can be is if all the weights are equal to *a* - if any weights are larger then it is just more likely to terminate sooner. If each iteration has an *a/b* chance of terminating then it takes an expected *b/a* iterations. – kaya3 Mar 09 '21 at 03:31
  • The monte carlo simulations also appear to be independent of `a` and `b`. I'm essentially always getting about 2 for the number of iterations for any choice of a,b,n. – user5965026 Mar 09 '21 at 03:33
  • If you are choosing the weights uniformly in the interval [a,b] then that sounds about right; if most of your weights are at the lower bound then you'll see a higher expected number of iterations. – kaya3 Mar 09 '21 at 05:27
1

This is not an answer per se, but just a tiny example to illustrate the algorithm devised by @kaya3

| value | weight |
| v1    | 1.0    |
| v2    | 1.5    |
| v3    | 1.5    |
| v4    | 2.0    |
| v5    | 1.0    |
| total | 7.0    |

The total weight is 7.0. It's easy to maintain in O(1) by storing it in some memory and increasing/decreasing at each insertion/removal.

The probability of each element is simply it's weight divided by total weight.

| value | proba |
| v1    | 1.0/7 | 0.1428...
| v2    | 1.5/7 | 0.2142...
| v3    | 1.5/7 | 0.2142...
| v4    | 2.0/7 | 0.2857...
| v5    | 1.0/7 | 0.1428...

Using the algorithm of @kaya3, if we draw a random index, then the probability of each value is 1/size (1/5 here).

The chance of being rejected is 50% for v1, 25% for v2 and 0% for v4. So at first round, the probability to be selected are:

| value | proba  |
| v1    |  2/20  | 0.10
| v2    |  3/20  | 0.15
| v3    |  3/20  | 0.15
| v4    |  4/20  | 0.20
| v5    |  2/20  | 0.10
| total | 14/20  | (70%)

Then the proba of having a 2nd round is 30%, and the proba of each index is 6/20/5 = 3/50

| value | proba 2 rounds |
| v1    |  2/20 +  6/200 | 0.130
| v2    |  3/20 +  9/200 | 0.195
| v3    |  3/20 +  9/200 | 0.195
| v4    |  4/20 + 12/200 | 0.260
| v5    |  2/20 +  6/200 | 0.130
| total | 14/20 + 42/200 | (91%)

The proba to have a 3rd round is 9%, that is 9/500 for each index

| value | proba 3 rounds            |
| v1    |  2/20 +  6/200 +  18/2000 | 0.1390
| v2    |  3/20 +  9/200 +  27/2000 | 0.2085
| v3    |  3/20 +  9/200 +  27/2000 | 0.2085
| v4    |  4/20 + 12/200 +  36/2000 | 0.2780
| v5    |  2/20 +  6/200 +  18/2000 | 0.1390
| total | 14/20 + 42/200 + 126/2000 | (97,3%)

So we see that the serie is converging to the correct probabilities. The numerators are multiple of the weight, so it's clear that the relative weight of each element is respected.

aka.nice
  • 9,100
  • 1
  • 28
  • 40
  • to be convinced about the bounds, you must redo an example with symbolic values in [a,b] instead of numerical values. The chance of being selected for a weight w in [a,b] is w/b. At each round, you get `p=proba_for_this_round * 1/(size) * w/b` to select some element of weight w... – aka.nice Mar 08 '21 at 21:54
  • kaya3 answer explained **THAT** it works, and above formulation should answer **HOW** it works, so I think you get all the elements. You can take [2,4] it will be the same as [1,2]. What counts is the ratio beteen lower and higher weight. So if you have weights [0.5,2], you will get more rounds in average. – aka.nice Mar 08 '21 at 22:00
  • As an exercice, you should take 99 values with 0.01% proba, and one with 99.01% proba, and check the probability to get a 2nd round... – aka.nice Mar 08 '21 at 22:17
  • I'm looking at this in more detail now. Is there a name for this algorithm? I want to read up on some background on how someone came up with this. – user5965026 Mar 08 '21 at 22:18
  • Yes, Las Vegas, click the link in the answer of kaya3, it's all there. – aka.nice Mar 08 '21 at 22:20
  • Ah I saw that earlier, but when I went on wikipedia, I misread it as some family of randomized algorithms rather than this specific algorithm. – user5965026 Mar 08 '21 at 22:21
  • There is a very simple reason it cannot depend on n: the only part where n occurs in the algorithm is when generating a random index, which takes O(1) time, and nothing else in the algorithm depends on n. Conceptually, you are just flipping coins until you get heads, and it doesn't affect the process whether you can choose from 100 coins or 1,000,000. – kaya3 Mar 09 '21 at 03:37
  • As for the name of the algorithm, I don't know if it a standard one, but it is a kind of rejection sampling. – kaya3 Mar 09 '21 at 03:39
1

This is a sketch of an answer.

With weights only 1, we can maintain a random permutation of the inputs. Each time an element is inserted, put it at the end of the array, then pick a random position i in the array, and swap the last element with the element at position i. (It may well be a no-op if the random position turns out to be the last one.) When deleting, just delete the last element.

Assuming we can use a dynamic array with O(1) (worst case or amortized) insertion and deletion, this does both insertion and deletion in O(1).


With weights 1 and 2, the similar structure may be used. Perhaps each element of weight 2 should be put twice instead of once. Perhaps when an element of weight 2 is deleted, its other copy should also be deleted. So we should in fact store indices instead of the elements, and another array, locations, which stores and tracks the two indices for each element. The swaps should keep this locations array up-to-date.

Deleting an arbitrary element can be done in O(1) similarly to inserting: swap with the last one, delete the last one.

Gassa
  • 8,546
  • 3
  • 29
  • 49
  • The weights can be floating point values. I think this algorithm only works if the weights were constrained to 1 and 2? Though I am interested in this interpretation on second thought. – user5965026 Mar 09 '21 at 05:49
  • @user5965026 This answer addresses your following specific question: _Is there something special about the weight being constrained to [1,2]?_ This is what's special with small integer weights: we can use them as the number of occurrences in what would otherwise be a permutation. – Gassa Mar 09 '21 at 12:11
  • Ah yes. I upvoted your answer, as I was not originally clear that the weighs could be floating points. Could you perhaps outline your algorithm a bit more in detail? Specifically, it appears that you suggesting we should have 2 arrays, but it you're suggesting both arrays should store indices instead of elements? I was envisioning maybe an array and a hash table. The key to the hash table is the element and the values are the indices. The array contains all the elements (two entries for each element with a weight of 2 and 1 entry for each element with a weight of 1)....cont.. – user5965026 Mar 09 '21 at 13:42
  • Then you random sample from the array of elements. Say you sampled `arr[i]`, you would then take this element and remove it from the hash table, but prior to removing, you find the second index for this element, if any. Then you need to remove this element and possibly its duplicate counterpart. You can swap it with the last 2 positions in the array. – user5965026 Mar 09 '21 at 13:44