3

I'm trying to design a data structure that supports random pop and insert operations. An element is popped randomly in accordance with their weight. For example, if the data structure has elements "a" and "b" with weights "10" and "20" then element "b" will have twice the likelihood of being popped than "a." n is the number of elements. The weights can be floating point or integers and are >=0.

I am thinking that a segment tree or binary indexed tree may be able to achieve both operations in O(log n) time, but I'm not certain. Anyone have any better ideas?

user5965026
  • 465
  • 5
  • 16
  • Can you make this question more precise. That is **1.** specify what n is (I assume the number of elements in the data structure) **2.** specify the restrictions on the weights, e.g. *"only natural numbers < m"* or *"only weights from the predefined set {1, 2, 5}"*. The more restrictions, the more efficient a solution might be. **3.** Just to make sure: This is not something like a priority queue, but more something like a set? E.g. when you have (a,1), (b,2), (c,2) then the first pop operation may return c with a probability of 2/(1+2+2). Is this correct? – Socowi Mar 08 '21 at 17:39
  • @Socowi I just added more details. For your 3rd question, yes, that's correct. The elements should be popped with a probability of their weight divided by the total weight of the elements. – user5965026 Mar 08 '21 at 17:42
  • 2
    I'd suggest avoiding the terms "push" and "pop" in your question, they're strongly affiliated with the concept of a stack. Nomenclature is important for effective communication. – pjs Mar 08 '21 at 18:22

1 Answers1

3

A variant kind of order statistic tree should be able to do this: have a self-balancing binary search tree where each node also stores the total weight of its subtree (where a standard version would store its cardinality).

Insertion and removal are already done in O(log n) time, and it is possible to take a weighted random sample in O(log n) time too: start by generating a random number uniformly in the range from 0 to the total weight of the whole tree, and start at the root node. Let t be the random number, l be the total weight of the left subtree (or zero if there is none), and c be the weight of the current node:

  • If t < l then recurse on the left subtree.
  • Otherwise if t - l < c then return the item in the current node.
  • Otherwise subtract l + c from t and recurse on the right subtree.
kaya3
  • 47,440
  • 4
  • 68
  • 97
  • I think this is what I'm envisioning as well. Just to clarify, when you say "each node also stores the total weight of its subtree" the total weight of the subtree also includes the contribution from the current node right? – user5965026 Mar 08 '21 at 17:47
  • Yup, this is definitely O(log n) one for both operations on average, though we need to use a self-balancing tree if we're seeking O(log n) worst case. If all the weights are bounded within [1,2], would there be a different approach that could get us constant time operations? – user5965026 Mar 08 '21 at 17:58
  • Yes, it needs to be a self-balancing BST, I've edited to add that. In the case where the weights are in [1,2], then it should be possible to get O(1) expected time by taking a random sample and then flipping a biased coin to decide whether to keep it or take another one; the probability of keeping a sample should be equal to the item's weight divided by 2. Since the weights are bounded below by a positive number (1 in this case), the expected number of rejections is a constant. I think this is a sufficiently different problem and solution that you should post it as a separate question. – kaya3 Mar 08 '21 at 18:08
  • Hmm yes I understand that approach but that seems to assume the weights can only be `1` or `2`. When it says "bounded in [1,2]" doesn't it mean it could be any real value in that interval, e.g., 1.01, 1.4, 1.555555, etc...? – user5965026 Mar 08 '21 at 18:16
  • It should work (and give the correctly-weighted distribution) for arbitrary real-numbered weights in an interval [a,b]. where 0 < a < b; the expected number of rejections would be at most b/a. – kaya3 Mar 08 '21 at 18:19
  • I'll post this as a new question and link you in a bit. It does seem too different from the current question now. – user5965026 Mar 08 '21 at 18:20
  • https://stackoverflow.com/questions/66535739/data-structure-to-achieve-random-delete-and-insert-where-elements-are-weighted-i – user5965026 Mar 08 '21 at 19:10
  • Coming back to this. It doesn't necessarily need to be self-balancing binary "search" tree right? Because that a BST has a specific definition. It just needs to be a self balancing binary? The reason why I'm nit picking is, I think this actually makes the self balancing procedure a bit simpler than the traditional self balancing BST. – user5965026 Mar 12 '21 at 17:48
  • Yes, in principle if there are no other operations you want to use on the tree, then there's no need to maintain the BST property. – kaya3 Mar 12 '21 at 17:52
  • Yeah, but I think the balancing procedure is still not trivial even if it's not specifically a BST – user5965026 Mar 12 '21 at 18:20
  • I don't think there should be any significant runtime cost for maintaining the BST property. As for code complexity, I don't think that will be significant either. But if you think it will be simpler then the solution will work without it being a BST, that's all. – kaya3 Mar 12 '21 at 18:22
  • @kaya3 wouldn't it be simpler to just use a complete tree (the kind you would use for binary heaps)? It wouldn't require the overhead of tree-balancing (whichever method is chosen) and you can store the nodes, in this case pairs of value and sum of subtree weights, in an array as well. Does that sound correct to you? – aripy887 Apr 13 '22 at 20:26
  • 1
    @aripy887 It sounds plausible; you avoid the cost of balancing, there is still an O(log n) cost of updating subtree weights on insertion and removal, but in practice it should be more efficient because it uses contiguous memory. Perhaps you should write it as an answer. I think I was going for "is there a solution?" rather than "what's the best solution?" when I wrote my answer. – kaya3 Apr 13 '22 at 22:05