Find a shortest distance between two buckets of numbers

Question

I have two buckets (unordered, 1-dimentional data structures) of numbers and I want to calculate a minimal distance between any elements of the two buckets. Is there a way to find the shortest distance between any number from different buckets in O(1)? What is my best bet?

Input
[B1] 1, 5, 2, 347, 50
[B2] 21, 17, 345

Output
2 // abs(347 - 345)

Edits

I expect to have more lookups than inserts
Distance between smallest and largest elements in any bucket is less than 10^5
Number of elements in any bucket is less than 10^5
Numbers in buckets are "nearly" sorted - these are timestamps of events. There's probably less than 1% of elements in the buckets that are out of order
The number of elements in buckets is small, but I need to lookup at an average rate of 2k/sec, and periodically drop stale buckets and replace them with new buckets, hence I want my lookups be in O(1)

See why I need this and what I have thought of in the previous question edition.

Those elements have to arrive in the buckets somehow, and you could get O(log n) query time if you add all elements in each bucket to a single balanced tree at the same time as you add them to the buckets (though this will increase bucket insertion cost to O(log n) if it isn't already): Just look up the next-smaller and next-larger elements from the other bucket as you insert them, and maintain the smallest-so-far. — j_random_hacker, Nov 29 '16 at 16:34
sort each bucket, then kindof mergesort them keeping track of the minimal distance along the way: `O(n+n/2.ln(n/2)) = O(n.ln(n))`. — YSC, Nov 29 '16 at 16:36
How big are your values? Specifically, how big is (largest_value - smallest_value) / window_size? If that's less than, say, 10000000, just create an array that size, with 2 bits per window_size-block of time. Then on each insert of x into a bucket, let y = (x - smallest_value) / window_size, and update array[y] and array[y+1] to reflect that an element from this bucket was added. Any pair of different-bucket items that overlap by window_size or less *must* hit the same element of array[]. — j_random_hacker, Nov 29 '16 at 16:39
@j_random_hacker: What if the other bucket isn't known at the time elements are inserted? What if the operation is done pairwise among a large number of buckets? What you are describing is an operation on one set with an internal partition, not an operation between two sets. — Ben Voigt, Nov 29 '16 at 16:46
Keep them in one container but tag each element with a bucket identifier. Then sort and do a running difference/minimum but only when the identifier changes. The sorting brings you up to O(n logn). The tagging can be done different ways: create pairs, polymorphism. — dpmcmlxxvi, Nov 29 '16 at 16:54
@BenVoigt: From the OP's description, there are 2 buckets, so I don't understand how one of them could be "unknown". If this is done pairwise between a large number of buckets, my second comment's suggestion doesn't worsen the time complexity *per bucket pair*. — j_random_hacker, Nov 29 '16 at 17:05
Thank you for your additional information. Would it be feasable to make sure, the buckets actually are sorted in advance? — cdonat, Nov 30 '16 at 04:51
@cdonat I cannot control the order of the events, as those are based on interrupts and are coming from two different sources. I can use a data structure that maintains an order for my buckets, though. — oleksii, Nov 30 '16 at 09:46
@oleksii yes, judging from your requirements I think, that data structure can simply be a vector. Just make sure, you insert sorted. Most of the time that will simply be appending, sometimes you'll have to move one or two elements backwards. — cdonat, Nov 30 '16 at 10:24

YSC · Answer 1 · 2016-11-30T00:42:42.147

Here is my attempt: sort each bucket, then kindof mergesort them keeping track of the minimal distance along the way: O(n+2.n/2.ln(n/2)) = O(n.ln(n)):

sort buk1
sort buk2
min = INT_MAX
last = some value
do
    if top(buk1) > top(buk2)
        min = min(min, abs(top(buk1) - last))
        last = top(buk1)
        pop(buk1)
    else
        min = min(min, abs(top(buk2) - last))
        last = top(buk2)
        pop(buk2)
while !empty(buk1) and !empty(buk2)

Violapterin · Answer 2 · 2016-11-29T16:51:54.647

Let there be n numbers in total.
1. Write all numbers in binary. ==> O(n)
2. Append 0 or 1 in each number, according to whether from B1 or B2. ==> O(n)
3. Quicksort them, ignoring the first bit. ==> O(n log n) in average
4. for the whole list, iterate through sorted order. For every two adjacent numbers u and v, if they came from both B1 or B2, ignore.
Otherwise, set tmp <-- abs(u-v) whenever tmp > abs(u-v). Thus, tmp is the minimal distance so far, within adjacent numbers.
The final tmp is answer. ==> O(n)

in total: ==> O(n log n) in average

cdonat · Answer 3 · 2016-11-30T05:50:49.810

O(1) is of course not possible.

Some pseudo code, that I'd use as a starting point:

sort(B1)
sort(B2)

i1 = 0
i2 = 0

mindist = MAX_INT

// when one of the buckets is empty, we'll simply return MAX_INT.
while(i1 < B1.size() && i2 < B2.size())
    t = B1[i1] - B2[i2]
    mindist = min(mindist, abs(t))
    if t > 0 
        i2 ++
    else
        i1 ++

return mindist

At least that is O(n log n), because it is dominated by the sorting at the beginning. If your buckets already are sorted, you can have O(n).

Edit:

After the new information, that the elements are almost sorted, I'd propose to actually sort them on insert. Insertion sort with its binary search is not the best for that situation. Just append the new element and swap it forward until it fits. Usually it will be no swaps and for the 1%, where you need swaps, 99% of the time it will be only one. The worst case complexity is O(n), but the average will almost be O(1).

If you consider to precalculate mindist for all pairs of buckets, you'd have to store i1 and i2 and mindist. Let's say B1 is the bucket, where you append a new element. You sort it in and reduce i2 until it is either 0 or B2[i2] < B1[i1]. Since the elements are timestamps, that will be at most one step most of the time. Then you run the while loop again, which usually will only a single step as well. So the computational complexity is O(k) for k buckets and the memory complexity is O(k^2).

Dave · Answer 4 · 2016-11-29T17:32:07.000

1

Create a bitvector of 10^5 elements for each bucket. Keep track of the min distance (initially 10^5 until both buckets are nonempty).

Now, say you're adding an element x to one of the buckets. Do the following:

1. Set the bit x of the same bucket.
2. Check whether the other bitvector has any set elements within min_distance-1 of x
3. Update min_distance as appropriate

Running time: On inserting it's O(min_distance), which is technically O(1) since min_distance is capped. On polling it's O(1) since you're just returning min_distance.

edit If the elements aren't capped at 10^5 but just the distance between the min and the max is, this will need to be modified but will still work. I can detail the necessary changes if this matters.

edited Nov 29 '16 at 17:32

answered Nov 29 '16 at 17:24

Dave

7,460
3
26
39

The number of elements is capped at ~ 10^5. Can you please explain how would you do number [2]? Is it `abs (bv1 - bv2)` of some sort? – oleksii Nov 29 '16 at 17:45
@oleksii Say you put element x in bucket_A, then: for(i = x - minsize + 1, i <= x + minsize -1, i++), check if bitvector_B[i] is set. If it is, you've found an element in B closer to x than your previous minimum size. There may be multiple in the range, so check all candidates and keep the smallest. – Dave Nov 29 '16 at 18:47
One efficiency you can use in the beginning in the above example: If |B| < 2*minsize, then check the elements of B instead of the bitvector. This saves you from checking large swathes of the bitvector early on while minsize is big and B is small. – Dave Nov 29 '16 at 22:33
You'd have to store that bitvector for every pair of buckets and keep it up to date. That increases the input complexity to O(k) with k being the number of buckets. The memory complexity is of course O(l * k^2) with l being the size of the bitvectors. The alternative is to build the bitvectors on demand. Then you'll have to insert all elements of both buckets, before you know min_distance. That is O(n) of course. – cdonat Nov 30 '16 at 05:33
@cdonat he has two buckets, so one pair. – Dave Nov 30 '16 at 13:04
@cdonat also, you store one bitvector per bucket, only the min value is per pair. – Dave Nov 30 '16 at 13:06
@DaveGalvin He never said, that there are only two buckets in the whole programm, just this function should handle only two of them. So then you'll have to add the insert effort to the complexity and that is O(n). – cdonat Nov 30 '16 at 13:07
@DaveGalvin OK, yes, you only need one bitvector per bucket. Then the memory complexity is O(l*k). – cdonat Nov 30 '16 at 13:09
@cdonat the size of the bitvectors is fixed because the size of the values is fixed at 10^5, so memory only grows with the number of buckets at O(k). – Dave Nov 30 '16 at 17:03
@DaveGalvin the values are not limited to 10^5, just the deltas are. The values are timestamps. In case of 32 bit timestamps, the bitmap is 512MB for every bucket. BTW. You still have memory requirements for every pair of buckets. Those grow at O(k^2). With only a few buckets, that is shadowed by the bitvectors, but will eventually outfrow them. – cdonat Dec 01 '16 at 06:07

score 1 · Answer 5 · answered Nov 29 '16 at 19:18

1

Insert your buckets into two Y-fast tries (https://en.wikipedia.org/wiki/Y-fast_trie). Search for nearest successor or predecessor are O(log log M), where M is the range (actually the max element, but we can offset), which in your case would cap at around four operations.

Since you'll store the nearest difference, lookup would be O(1) (unless you get the full buckets each time rather than continually updating), while insertion, deletion and update per element would be O(log log M).

answered Nov 29 '16 at 19:18

גלעד ברקן

23,602
3
25
61

1

Building the trees is O(n log log M) – Ripi2 Nov 29 '16 at 19:31
@Ripi2 that's correct. Since `M` and `n` are capped at `10^5`, `log n ≈ 16`, whereas `log log M ≈ 4`, that would be a log factor faster than `n log n`. – גלעד ברקן Nov 29 '16 at 19:37

Ripi2 · Answer 6 · 2016-11-29T18:22:54.277

I like Dave Galvin idea, slightly modified:

Let maxV be the maximum number of elements maxV=max(bucket1.size, bucket2.size)

1. Build two arrays, each of size maxV. Fill them:

for (j=0 to bucket1.size)
    array1(bucket1(j)) = bucket1(j)
for (j=0 to bucket2.size)
        array2(bucket2(j)) = bucket1(j)

Arrays are now sorted. The rest of elements in arrays are 0.

2. Now use two iterators, one for each array:

it1 = array1.begin
it2 = array2.begin
while (it1 == 0)
   ++it1
while (it2 == 0)
   ++it2
minDist = abs(it1-it2)
while (it1 != array1.end && it2 != array2.end)
{   //advance until overpass the other
    while (it1 <= it2 && it1 != array1.end)
        ++it1
        if (it1 > 0)
            check minDist between it1, it2
    while (it2 <= it1 && it2 != array2.end)
        ++it2
        if (it2 > 0)
            check minDist between it1, it2
    if (it1 = it2)
        //well, minDist = 0
        return now
}

Step 1 is O(n). Step 2 is also O(n). I don't know if this is more efficent or not than sorting the buckets for large or short buckets.

1. `mavV` has to be the maximum of the values, not the sizes. Otherwise `array1(bucket1(j)) will fail. You need another O(n) loop to determine that. 2. The buckets might contain 0s. How do you distinguish them from empty elements of your arrays? 3. The last step is O(maxV), which is not dependent on n (the number of elements) , but on the max value. — cdonat, Nov 30 '16 at 05:05

score 0 · Answer 7 · answered Nov 29 '16 at 19:47

0

Consider precomputing the answer for each number in both lists and storing them as an array. Use the subscript of each number in the list, and use that to subscript to the position in the array that contains the difference.

That gives O(1) lookup.

answered Nov 29 '16 at 19:47

EvilTeach

28,120
21
85
141

Find a shortest distance between two buckets of numbers

7 Answers7