3

I'm making a program that tests and compares stats of Multi-Key Sequential search and Interpolation binary search. I'm asking for an advice:

What is the best way to sort a random-generated array of integers, or even generate it like a sorted one (if that makes any sense) in given context?

I was looking into some sorting techniques, but, if you keep in mind that the accent is on searching (not sorting) performance, all of the advanced sorts seem rather complicated to be used in just one utility method. Considering that the array has to be larger than 106 (for testing purposes), Modified/Bubble, Selection or Insertion sorts are not an option.

Additional constraint is that all of the array members must be unique.

Now, my initial idea was to split the interval [INT_MIN,INT_MAX] into n intervals (n being the array length) and then add a random integer from, 0 to 232/n (rounded down), to every interval beginning.

The problem is this:

I presume that, as n rises closer to 232, like mine does, Interpolation search begins to give better and better results, as it's interpolation gets more accurate.

However:

If I rely solely on pseudo-random number generators (like rand();), their dispersion characteristics dictate the same tendency for a generated-then-sorted array, that is - Interpolation gets better at pinpointing the most likely location as the size gets closer to int limit. Uniformity/dispersion characteristics get lost as n rises to INT_MAX, so, due to stated limitations, Interpolation seems to always win.

Feel free do discuss, criticize and clarify this question as you see fit, but I'm rather desperate for an answer, because the test seems to be rigged in Interpolation's favor either way and I want to analyze them fairly. In short: I want to be convinced that my initial idea doesn't tilt the scales in Interpolation's favor even further, and I want to use it because it's O(n).

Stefan Stanković
  • 628
  • 2
  • 6
  • 17
  • I'm not sure I get this right, do you want a method to generate a random array of sorted integers, with the criteria being that the array is large (10^6)? any definition for what items should be there (spread, uniformity...)? – Amit Oct 12 '15 at 13:09
  • @Amit Just that they are random, unique and sorted. – Stefan Stanković Oct 12 '15 at 13:18
  • so [1,2,3,6,7,8] is valid? – Amit Oct 12 '15 at 13:24
  • 1
    So just walk it.. for every element, set the value to *a[i-1]+random(x)* where *random(x)* is a positive integer bound such that it leaves enough room for n-i elements (should be simple to calculate). – Amit Oct 12 '15 at 13:28
  • 1
    I'm voting to close it as primarily opinion-based (and too bread also), and such questions are not for this site. However, if "Modified/Bubble, Selection or Insertion" are the only sorting methods you know, then you definitely need more reading on this. – Petr Oct 12 '15 at 13:29
  • @Petr Thanks for the opinion, then. :D I said that advanced sorts are not an option because the friggin program has to be crammed in a single .cpp file. – Stefan Stanković Oct 12 '15 at 13:37
  • @Amit Yes, that was my initial intention. – Stefan Stanković Oct 12 '15 at 13:38
  • @StefanStanković, many (and all most common) `O(n log n)` sorts not only fit into one cpp file, they usually fit into one 20-30 lines long function... – Petr Oct 12 '15 at 13:39
  • Soo... is there still an open question here? did my comment help you? – Amit Oct 12 '15 at 13:43
  • @Amit I still didn't get an answer to: How to do it without skewing the test results by decreasing the "randomness" of PRNG with interval slicing? – Stefan Stanković Oct 12 '15 at 13:47
  • @Petr Didn't want to heapSort() it, because I'm pretty certain that the result would be the same: for the very long array, elements would form near-perfect linear function that's easily interpolated and that's just the result of `INT_MAX` and `RAND_MAX` values. – Stefan Stanković Oct 12 '15 at 13:58

4 Answers4

3

Here is a method to generate an ordered random sequence. This uses Knuth's algorithm S and taken from the book Programming Pearls.

This requires a function that returns a random double in the range [0,1). I included my_rand() as an example. I've also modified it to take an output iterator for the destination.

namespace
{
    std::random_device rd;
    std::mt19937 eng{ rd() };
    std::uniform_real_distribution<> dist; // [0,1)
    double my_rand() { return dist(eng); }
}

// Programming Pearls column 11.2
// Knuth's algorithm S (3.4.2)
// output M integers (in order) in range 1..N
template <typename OutIt>
void knuth_s(int M, int N, OutIt dest)
{
    double select = M, remaining = N;
    for (int i = 1; i <= N; ++i) {
        if (my_rand() < select / remaining) {
            *dest++ = i;
            --select;
        }
        --remaining;
    }
}

int main()
{
    std::vector<int> data;

    knuth_s(20, 200, back_inserter(data)); // 20 values in [1,200]
}

Demo in ideone.com

Blastfurnace
  • 18,411
  • 56
  • 55
  • 70
  • This technique works reasonably when m is roughly similar in size to n. If m << n though, it will be slow since it takes O(n) time, whereas simply generating m integers and sorting it is O(m*log(m)). – BeeOnRope May 02 '19 at 20:03
2

So you want to generate an "array" that has N unique random numbers and they must be in a sorted order? This sounds like a perfect use for a std::set. When inserting elements into a set they are sorted for us automatically and a set can only contain unique elements so it takes care of checking if the random number has already been generated.

std::set random_numbers;
std::random_device rd;
std::mt19937 mt(rd());
while (random_numbers.size() < number_of_random_numbers_needed)
{
    random_numbers.insert(mt());
}

Then you can convert the set to something else like a std::vector or std::array if you don't want to keep it as a set.

NathanOliver
  • 171,901
  • 28
  • 288
  • 402
1

What about generating a sorted array from statistical properties ?

This probably needs some digging but you should be able to generate the integers in order by adding a random difference whose mean is the standard deviation of your overall sample.

That raises some problem at range boundaries, but given the size of your sample you can probably ignore it.

kriss
  • 23,497
  • 17
  • 97
  • 116
0

OK, this I've decided to transfer the responsibility to built-in PRNG and do the follwing:

Add n rand() results to binary tree and fill the array by traversing it in order (from leftmost leaf).

Stefan Stanković
  • 628
  • 2
  • 6
  • 17
  • 1
    What will happen if your rand() gives you the same number more than once? You will end up with less than n numbers, right? –  Oct 12 '15 at 15:00
  • @Boris Nope, number of elements is incremented only on successful addition, end-condition is that *n* elements are added, not generated. – Stefan Stanković Oct 12 '15 at 15:16