What's wrong with this Interpolation search implementation?

Question

This is a common C/C++ implementation of the Interpolation Search algorithm found around the Internet. However, when used with a sorted array of some 100000 integers, the mid-variable starts generating negative array-indexes, causing a Segmentation Fault. What could the problem be?

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int interpolationSearch(int sortedArray[], int toFind, int len) {
    // Returns index of toFind in sortedArray, or -1 if not found
    int low = 0;
    int high = len - 1;
    int mid;

    while (sortedArray[low] <= toFind && sortedArray[high] >= toFind) {
        mid = low + ((toFind - sortedArray[low]) * (high - low)) /
              (sortedArray[high] - sortedArray[low]);

        if (sortedArray[mid] < toFind) {
            low = mid + 1;
        } else if (sortedArray[mid] > toFind) {
            high = mid - 1;
        } else {
            return mid;
        }
    }

    if (sortedArray[low] == toFind)
        return low;
    else
        return -1; // Not found
}

int main(void) {
    srand(time(0));
    int arr[100000];
    for (int i=0; i<100000; i++) {
        arr[i] = rand()%100000;
    }

    int length = sizeof(arr)/sizeof(int);
    qsort(arr,length,sizeof(int),order);

    for (int j=0; j<10000; j++) {
        interpolationSearch(arr,rand()%100000,length);
    }
}

score 4 · Answer 1 · edited May 23 '17 at 12:15

4

The problem is with the expression that computes mid. The product can easily overflow even with 32 bits integers. Then it becomes negative. It would probably be better to perform division before product.

Changing the mid computing to use 64 bits integers (at least for intermediate computings) fix problems.

Below is my modified version (int64_t is defined in <stdint.h>:

int interpolationSearch(int sortedArray[], int toFind, int len) {
    // Returns index of toFind in sortedArray, or -1 if not found
    int low = 0;
    int high = len - 1;
    int mid;

    int l = sortedArray[low];
    int h = sortedArray[high];

    while (l <= toFind && h >= toFind) {
        int64_t high_low = (high - low);
        int64_t toFind_l = (toFind - l);
        int64_t product = high_low*toFind_l;
        int64_t h_l = h-l;
        int64_t step = product / h_l;
        mid = low + step;

/*        mid = (low + high)/2;*/
        int m = sortedArray[mid];

        if (m < toFind) {
            l = sortedArray[low = mid + 1];
        } else if (m > toFind) {
            h = sortedArray[high = mid - 1];
        } else {
            return mid;
        }
    }

    if (sortedArray[low] == toFind)
        return low;
    else
        return -1; // Not found
}

An even simpler fix would be to make it a dichotomic search instead of interpolation by just using: mid = (low + high) / 2. even if it converges slightly slower than interpolation, it avoids several operations including a product and a division thus making the inner loop faster. Not sure the potential faster convergence of interpolation compensates for that loss of simplicity.

I did some performance tests. The source of my test program is included in this question

Suprisingly (for me) using floats gives a more efficient program than using large integers. On my system binary search became faster for about 1000 items in the array. For arrays of size 100000 interpolation search is nearly two times faster than simple binary search.

edited May 23 '17 at 12:15

Community

1
1

answered Jan 20 '11 at 20:12

kriss

23,497
17
97
116

Doesn't even have to be a 16 bit problem. The product in the expression that calculates `mid` can easily overflow with normal 32 and even 64 bit integers, if the values stored in the array are large enough. – Björn Pollex Jan 20 '11 at 20:18
@Space_C0wb0y: Yes, you're right. I'm not yet sure the interpolation expression may overflow, but there is definitely something wrong there. – kriss Jan 20 '11 at 20:37
I'm running gcc on WinXP, using a Intel Pentium M processor. – Gorkamorka Jan 20 '11 at 20:40
1

@kriss: Suppose the last element in the array has the value `100000`, and first one `1`. Suppose we are looking for the value `99999`. Then in the first run this product will yield 9999800001, which way to big for a 32 bit integer. – Björn Pollex Jan 20 '11 at 20:40
@Space_C0wb0y: Yes, totally right. But there is also another problem. I tested it on a 64 bits system where it does not overflow an in some case the loop goes forever. Still trying to understand this one. – kriss Jan 20 '11 at 20:43
A printf of sizeof(int) yielded the answer 4. The int-overflow theory might be onto something, I tried substituting int low, int high and int mid with long long int's and had a successful run. However, should the mid array really be getting that big values? – Gorkamorka Jan 20 '11 at 20:46
The problem is with the intermediate product `(high - low) * (toFind - sortedArray[low])` that can become very large. Each member can be 16 bits, so the product can be 32 bits and overflow. – kriss Jan 20 '11 at 20:57
@Space_C0wb0y: OK, the other problem I saw was also related to the same overflow. It just occurs my ints where also 32 bits even on my 64 bits system. – kriss Jan 20 '11 at 21:11
1

The problem with your revised expression is that integer division will normally cause it to underflow to zero. – Gareth Rees Jan 20 '11 at 21:18
@Gareth Rees: Yes, you are right. Then definitely we should either go 64 bits integers or dichotomy. – kriss Jan 20 '11 at 21:26
@Gareth Rees: as I'm curious of performance, I will add some timing mesures to my answer. – kriss Jan 20 '11 at 21:51
1

@kriss: Normal use case for interpolation search is for very large tables, or when random lookups are expensive, such as disk searches. For well-behaved data, interpolation search is O(log log n). – Gareth Rees Jan 20 '11 at 21:54
@kriss: Knuth recommends that you switch from interpolation to binary search when the remaining portion to be searched is small enough for binary search to be faster. See *The Art of Computer Programming* §6.2.1. – Gareth Rees Jan 20 '11 at 21:56
@Gareth Rees: Actually I tried something like fixed points as you where typing your answer (as the product result was likely converging toward `(high-low) / 2` but small power values for F were not enough with my random input. – kriss Jan 20 '11 at 21:57
@Gareth Rees: that's exactly why I want some actual mesures. I wonder where is the point where binary search becomes faster for this implementation with current processors, compilers and memory managers. My raw guess would be that is is well above 1000000 items, but I may be wrong. – kriss Jan 20 '11 at 22:03
@Gareth Rees: OK, let's do some thumb thinking: log2(100000) ~ 17 log2(log2(100000)) ~ 5. I would say that the simpler expression is at least 3 times faster, but that memory accesses are even more costly, say ten times with cache misses. OK is my evaluation are right even performing one loop less would already be a winner move. If my previous reasoning is correct the binary search should become faster at a quite low size, maybe less than 100 items... I'm surprised of this estimation. A check with real values is definitely necessary. – kriss Jan 20 '11 at 22:15

score 4 · Accepted Answer · answered Jan 20 '11 at 20:53

The sub-expression: ((toFind - sortedArray[low]) * (high - low))

... can easily evaluate to something like: ((99999-0) * (99999-0)) == 99999^2

... which is much larger than 2^31 (== the range of 32-bit signed integers).

Once it exceeds 2^31-1, the integer will overflow into negative numbers, hence your negative indices. If it exceeds 2^32 (which it also could do), then (most likely, technically undefined) you'll lose the high-order bits and you'll end up with effectively random offsets, both positive and negative.

To avoid all of this, you need to do your math carefully to make sure none of your sub-expressions yield an integer overflow. Usually the easiest way to do this is to convert to floating-point whose range is many orders of magnitude larger than 32-bit integers.

In the final analysis, interpolation such as this for binary search is usually not worth it -- the expense of computing the interpolant is typically greater than the few extra iterations of the loop that it "saves".

and floating point computing would even makes things worse, as it costs much more than computing with integers. — kriss, Jan 20 '11 at 21:07
@kriss: Notice I said "easiest way", not "most efficient way". — mcmcc, Jan 20 '11 at 21:16
One use case where interpolation is a big win is searching for a target position in unindexed, compressed video (particularly large video files). This is because you can't directly read the timestamp at a given byte position - you have to scan forwards or backwards, reading the data, to find it. — caf, Jan 21 '11 at 00:39

score 4 · Answer 3 · answered Jan 20 '11 at 21:30

As the other answers have explained, you're trying to compute an expression of the form

A * B / C

but this goes wrong because A * B overflows. The suggestion to revise the expression to

A * (B / C)

won't work, because typically B is less than C and so the integer division will truncate to zero.

The suggestion to switch to floating-point would work, but would be costly. But you could use fixed point by transforming the expression to:

A * ((B * F) / C) / F

(where F is a carefully chosen power of two).

Generally, on modern processors, a floating point divide will be cheaper/faster than an integer or fixed point divide. However, the conversions may make the convert+float divide slower overall. — Chris Dodd, Jan 21 '11 at 00:40

What's wrong with this Interpolation search implementation?

3 Answers3

Linked