10

There is an arbitrary amount of distinct unsigned integer values within a known range.

The number of integer values is << the number of integers within the range.

I want to build a data structure which allows the following runtime complexities:

  1. Insertion in O(1)
  2. After insertion is done:
    • Deletion in O(1)
    • Get all values within a query range in O(k) with k being the number of result values (returned values do not have to be sorted)

Memory complexity is not restricted. However, an astronomically large amount of memory is not available ;-)

Here is an example:

  • range = [0, 1023]
  • insert 42
  • insert 350
  • insert 729
  • insert 64
  • insert 1
  • insert 680
  • insert 258
  • find values in [300;800] ; returns {350, 729, 680}
  • delete 350
  • delete 680
  • find values in [35;1000] ; returns {42, 258, 64, 729, 258}
  • delete 42
  • delete 258
  • find values in [0; 5] ; returns {1}
  • delete 1

Is such a data structure even possible? (with the aid of look-up tables etc)?

An approximation I thought about would be:

  • Bin the inserted values into buckets. 0..31 => bucket 0, 32..63 => bucket 1, 64..95 => bucket 2, 96..127 => bucket 3, ...

  • Insertion: find bucket id using simple shifting arithmetic, then insert it into an array per bucket

  • Find: find bucket id of start and endpoint using shifting arithmetic. Look through all values in the first and last bucket and check if they are within the range or outside the range. Add all values in all intermediate buckets to the search result

  • Delete: find bucket id using shifting. Swap value to delete with last value in bucket, then decrement count for this bucket.

Downside: if there are many queries which query a range which has a span of less than 32 values, the whole bucket will be searched every time.

Downside 2: if there are empty buckets within the range, they will also be visited during the search phase.

Etan
  • 17,014
  • 17
  • 89
  • 148
  • theoretically, if your range [let it be `r`] is bounded, and your number of elements [let it be `n`] is smaller then r [`n << r`], then for each monotonically increasing function `f`: `O(f(n)) <= O(f(r)) = O(1)` – amit Dec 08 '11 at 15:50
  • Your bucket solution range query is not O(k) but O(number of buckets between query (p,q)). If the range is 0-1million and only two values (0,1000000) are inserted a range query of (0,100000) will be not be O(k) – parapura rajkumar Dec 08 '11 at 15:57
  • amit: yep, but O(1) does not express that it is fast in your case, since the constant is too high ;-) but yes, your comment is correct. – Etan Dec 08 '11 at 15:59
  • parapura rajkumar: that's why I described my solution as "an approximation". *editing and bolding it*. I'm looking for a better solution through this website – Etan Dec 08 '11 at 15:59

2 Answers2

8

Theoretically speaking, a van Emde Boas tree is your best bet, with O(log log M)-time operations where M is the size of the range. The space usage is quite large, though there are more efficient variants.

Actually the theoretical state of the art is described in the paper On Range Reporting in One Dimension, by Mortensen, Pagh, and Patrascu.

I'm not sure if the existing lower bounds rule out O(1), but M won't be large enough to make the distinction matter. Instead of the vEB structure, I would just use a k-ary trie with k a power of two like 32 or 64.

EDIT: here's one way to do range search with a trie.

Let's assume each datum is a bit pattern (easy enough; that's how the CPU think of it). Each subtree consists of all of the nodes with a certain prefix. For example, {0000, 0011, 0101, 1001} is represented by the following 4-ary trie, where X denotes a null pointer.

+---+---+---+---+
|00\|01\|10\|11X|
+--|+--|+--|+---+
   |   |   |
   |   |   +----------------------------+
+--+   |                                |
|      +------------+                   |
|                   |                   |
v                   v                   v
+---+---+---+---+   +---+---+---+---+   +---+---+---+---+
|00\|01X|10X|11\|   |00X|01\|10X|11X|   |00X|01\|10X|11X|
+--|+---+---+--|+   +---+--|+---+---+   +---+--|+---+---+
   |           |           |                   |
   v           v           v                   v
  0000        0011        0101                1001

A couple optimizations quickly become apparent. First, if all of the bit patterns are the same length, then we don't need to store them at the leaves—they can be reconstructed from the descent path. All we need is the bitmap, which if k is the number of bits in a machine word, fits nicely where the pointer from the previous level used to be.

+--------+--------+--------+--------+
|00(1001)|01(0100)|10(0100)|11(0000)|
+--------+--------+--------+--------+

In order to search the trie for a range like [0001, 1000], we start at the root, determine which subtrees might intersect the range and recurse on them. In this example, the relevant children of the root are 00, 01, and 10. The relevant children of 00 are the subtrees representing the prefixes 0001, 0010, and 0011.

For k fixed, reporting from a k-ary trie is O(log M + s), where M is the number of bit patterns and s is the number of hits. Don't be fooled though—when k is medium, each node occupies a couple cache lines but the trie isn't very high, so the number of cache misses is pretty small.

Per
  • 2,594
  • 12
  • 18
  • Memory requirement of the vEB tree should be ok. What's the advantage of the "van Emde Boas tree" over a "y-trie"? Could you explain the idea of the 32-ary trie a bit more please? – Etan Dec 08 '11 at 18:26
0

You could achieve your target (O(1),O(1) and O(k)) if the query operation required that it be told the value of at least one existing member that is already in the relevant range (the lower bound perhaps). Can you provide a guarantee that you will already know at least one member of the range? I guess not. I will expand if you can.

I'll now focus on the problem as specified. Each number in the data structure should form part of a linked list, such that each number knows the next highest number that is in the data structure. In C++

struct Number {
    struct Number *next_highest;
    int value;
};

Obviously, the largest value in the set will have next_highest==NULL, but otherwise this->value < this->next_highest->value

To add or remove or query, we need to be able to find the existing Numbers which are close to a particular lookup value.

set<Number *, specialized_comparator_to_compare_on_value_t >

Insertion and deletion would be O(log(N)), and query would be O(log(N)+k). N is the number of values currently in the set, which as you say will be much less than M (the number of possible values of the relevant datatype). Therefore log(N) < log(M). But in practice, other methods should also be considered, such as tries and such datastructures.

Aaron McDaid
  • 26,501
  • 9
  • 66
  • 88
  • a member inside the query range is not known, only that the query will be over the same overall range of integers than the stored integers. What you describe is essentially a binary search tree. – Etan Dec 09 '11 at 00:29