Storing a bucket of numbers in an efficient data structure

Question

I have a buckets of numbers e.g. - 1 to 4, 5 to 15, 16 to 21, 22 to 34,.... I have roughly 600,000 such buckets. The range of numbers that fall in each of the bucket varies. I need to store these buckets in a suitable data structure so that the lookups for a number is as fast as possible.

So my question is what is the suitable data structure and a sorting mechanism for this type of problem.

Thanks in advance

What is in each bucket? The numbers themselves? Counts (like in a histogram?) or do you only care about the ranges? — Jared Updike, Apr 09 '10 at 22:06
Each bucket is associated with a certain string which I need fetch after figuring out which bucket a given number false into. — BlitzKrieg, Apr 09 '10 at 22:13
are the buckets contiguous and disjoint, as in your example? — Federico A. Ramponi, Apr 09 '10 at 22:18

Federico A. Ramponi · Accepted Answer · 2010-04-10T21:27:39.690

If the buckets are contiguous and disjoint, as in your example, you need to store in a vector just the left bound of each bucket (i.e. 1, 5, 16, 22) plus, as the last element, the first number that doesn't fall in any bucket (35). (I assume, of course, that you are talking about integer numbers.)

Keep the vector sorted. You can search the bucket in O(log n), with kind-of-binary search. To search which bucket does a number x belong to, just go for the only index i such that vector[i] <= x < vector[i+1]. If x is strictly less than vector[0], or if it is greater than or equal to the last element of vector, then no bucket contains it.

EDIT. Here is what I mean:

#include <stdio.h>

// ~ Binary search. Should be O(log n)
int findBucket(int aNumber, int *leftBounds, int left, int right)
{
    int middle;

    if(aNumber < leftBounds[left] || leftBounds[right] <= aNumber) // cannot find
        return -1;
    if(left + 1 == right) // found
        return left;

    middle = left + (right - left)/2;

    if( leftBounds[left] <= aNumber && aNumber < leftBounds[middle] )
        return findBucket(aNumber, leftBounds, left, middle);
    else
        return findBucket(aNumber, leftBounds, middle, right);
}


#define NBUCKETS 12
int main(void)
{
    int leftBounds[NBUCKETS+1] = {1, 4, 7, 15, 32, 36, 44, 55, 67, 68, 79, 99, 101};
    // The buckets are 1-3, 4-6, 7-14, 15-31, ...

    int aNumber;
    for(aNumber = -3; aNumber < 103; aNumber++)
    {
        int index = findBucket(aNumber, leftBounds, 0, NBUCKETS);
        if(index < 0)
            printf("%d: Bucket not found\n", aNumber);
        else
            printf("%d belongs to the bucket %d-%d\n", aNumber, leftBounds[index], leftBounds[index+1]-1);
    }   
    return 0;
}

I think the time to search a key will be much more on a list-based solution than a tree-based based solution — BlitzKrieg, Apr 09 '10 at 22:44
I didn't say 'list'; I said 'vector' (or, an array). If you mean linked list, I agree :) Put the left bounds in any data structure that keeps them ordered and lets you search in O(log n)... — Federico A. Ramponi, Apr 09 '10 at 23:16
@BlitzKrieg The average height of a balanced binary search tree is O(log n). Lookups, therefore, are O(log n). The same O(log n) that lookups in a sorted array of buckets would be. The difference in speed between the two will have to do with memory usage and memory access patterns. On both those counts, the sorted array wins: it has no memory usage overhead (a balanced binary tree has at least two pointers of overhead, usually a little more, e.g., for a red/black tag) and its memory locality, at least toward the end, will be better. — jemfinch, Apr 10 '10 at 06:06
@Federico: I think you mean "Should be O(log n)" in your code. Also, you really shouldn't have to write your own binary search code; it's unfortunate that C's `bsearch` function returns NULL on lookup failure, rather than a pointer to the greatest element less than the key (which C++'s `std::binary_search` returns). — jemfinch, Apr 10 '10 at 06:13

score 2 · Answer 2 · answered Apr 09 '10 at 22:08

2

You will probably want some kind of sorted tree, like a B-Tree, B+ Tree, or Binary Search tree.

answered Apr 09 '10 at 22:08

Jared Updike

7,165
8
46
72

score 2 · Answer 3 · answered Apr 09 '10 at 22:12

If I understand you correctly, you have a list of buckets and you want, given an arbitrary integer, to find out which bucket it goes in.

Assuming that none of the bucket ranges overlap, I think you could implement this in a binary search tree. That would make the lookup possible in O(logn) (whenere n=number of buckets).

It would be simple to do this, just define the left branch to be less than the low end of the bucket, the right branch to be greater than the right end. So in your example we'd end up with a tree something like:

    16-21
    /    \
  5-15  22-34
  /
1-4

To search for, say, 7, you just check the root. Less than 16? Yes, go left. Less than 5? No. Greater than 15? No, you're done.

You just have to be careful to balance your tree (or use a self balancing tree) in order to keep your worst-case performance down. this is really important if your input (the bucket list) is already sorted.

score 1 · Answer 4 · answered Apr 09 '10 at 23:42

+1 to the kind-of binary search idea. It's simple and gives good performance for 600000 buckets. That being said, if it's not good enough, you could create an array with MAX BUCKET VALUE - MIN BUCKET VALUE = RANGE elements, and have each element in this array reference the appropriate bucket. Then, you get a lookup in guaranteed constant [O(1)] time, at the cost of using a huge amount of memory.

If A) the probability of accessing buckets is not uniform and B) you knew / could figure out how likely a given set of buckets were to be accessed, you could probably combine these two approaches to create a kind of cache. For example, say bucket {0, 3} were accessed all the time, as was {7, 13}, then you can create an array CACHE. . .

int cache_low_value = 0;

int cache_hi_value = 13;

CACHE[0] = BUCKET_1

CACHE[1] = BUCKET_1

...

CACHE[6] = BUCKET_2

CACHE[7] = BUCKET_3

CACHE[8] = BUCKET_3

...

CACHE[13] = BUCKET_3

. . . which will allow you to find a bucket in O(1) time assuming the value you're trying to associate a value with a bucket is between cache_low_value and cache_hi_value (if Y <= cache_hi_value && Y >= cache_low_value; then BUCKET = CACHE[Y]). On the up side, this approach wouldn't use all the memory on your machine; on the downside, it'd add the equivalent of an additional operation or two to your bsearch in the case you can't find your number / bucket pair in the cache (since you had to check the cache in the first place).

score 0 · Answer 5 · 2013-01-17T23:52:55.867

Let me see if I can restate your requirement. It's analogous to having, say, the day of the year, and wanting to know which month a given day falls in? So, given a year with 600,000 days(an interesting planet), you want to return a string that is either "Jan","Feb","Mar"... "Dec"?

Let me focus on the retrieval end first, and I think you can figure out how to arrange the data when initializing the data structures, given what has already been posted above.

Create a data structure...

typedef struct {
  int DayOfYear    :20; // an bit-int donating some bits for other uses
  int MonthSS      :4;  // subscript to select months
  int Unused       :8;  // can be used to make MonthSS 12 bits
} BUCKET_LIST;

  char MonthStr[12] = "Jan","Feb","Mar"... "Dec";
.

To initialize, use a for{} loop to set BUCKET_LIST.MonthSS to one of the 12 months in MonthStr.

On retrieval, do a binary search on a vector of BUCKET_LIST.DayOfYear (you'll need to write a trivial compare function for BUCKET_LIST.DayOfYear). Your result can be obtained by using the return from bsearch() as the subscript into MonthStr...

pBucket = (BUCKET_LIST *)bsearch( v_bucket_list);
MonthString = MonthStr[pBucket->MonthSS];

The general approach here is to have collections of "pointers" to the strings attached to the 600,000 entries. All of the pointers in a bucket point to the same string. I used a bit int as a subscript here, instead of 600k 4 byte pointers, because it takes less memory (4 bits vs 4 bytes), and BUCKET_LIST sorts and searches as a species of int.

Using this scheme you'll use no more memory or storage than storing a simple int key, get the same performance as a simple int key, and do away with all the range checking on retrieval. IE: no if{ } testing. Save those if{ }s for initializing the BUCKET_LIST data structure, and then forget about them on retrieval.

I refer to this technique as subscript aliasing, as it resolves a many-to-one relationship by converting the subscript of the many to the subscript of the one - very efficiently I might add.

My application was to use an array of many UCHARs to index a much smaller array of double floats. The size reduction was enough to keep all of the hot-spot's data in L1 cache on the processor. 3X performance gain just from this one little change.

score 0 · Answer 6 · answered Apr 09 '10 at 22:26

A simple way to store and sort these in C++ is to use a pair of sorted arrays that represent the lower and upper bounds on each bucket. Then, you can use int bucket_index= std::distance(lower_bounds.begin(), std::lower_bound(lower_bounds, value)) to find the bucket that the value will match with, and if (upper_bounds[bucket_index]>=value), bucket_index is the bucket you want.

You can replace that with a single struct holding the bucket, but the principle will be the same.

Storing a bucket of numbers in an efficient data structure

6 Answers6