0

Possible Duplicate:
Algorithm to find k smallest numbers in array of n items

How do you find the first 20 smallest elements in a very big array ?

Community
  • 1
  • 1
Sumeet
  • 7
  • 2
  • Do you want the first elements or the 20 smallest elements? You can't have both – PiTheNumber Oct 12 '11 at 13:09
  • Possibly more suitable for [programmers.stackexchange.com](http://programmers.stackexchange.com), as this appears to be platform agnostic? – Kasaku Oct 12 '11 at 13:10
  • @PirateKitten Platform agnostic doesn't absolve itself from being algorithms. Programmers is about processes, not algorithms. – corsiKa Oct 12 '11 at 13:12
  • Ah, fair enough, though there is [overlap](http://meta.stackexchange.com/questions/108695/algorithms-on-programmers-se-or-stack-overflow), and given there was no source involved I thought it was a valid suggestion. – Kasaku Oct 12 '11 at 13:15

4 Answers4

2

You have two options

  1. Sort the array and pull the 20 elements on the small end (depends on which direction you sort the array, right?)
  2. Keep a sorted set (may not be a set due to nonuniqueness of the array) of elements of the array. Add the first 20 elements in the array. Each time you find one smaller than the highest element in the 'good set', replace the highest element with this new element.

The second one may seem slower, but it really depends on the size of the array. You could do it with one pass through the array, so it might be better to do this on an array of eight billion or something.

Edit: the first algorithm is O(n lg n). The second algorithm is O(k n), where k in this case is 20 (you want the first 20). So the second algorithm is faster when lg n > 20 or n > 2^20 or n > ~1 million. So if you have less than a million you're better off sorting. If you have more than a million you're better off making the external list and going through with one pass.

corsiKa
  • 81,495
  • 25
  • 153
  • 204
  • Wow, I'm very curious as to why this is downvoted as (not to toot my own horn, but) it is the most detailed and correct answer on the list. – corsiKa Oct 12 '11 at 13:19
  • Option 2 will be faster when array contains over 1 million elements since sorting is O(n lg n) and lg 1000000 = 20. The critical limit is probably even smaller since n lg n sorting is more complex job than updating a list of 20 elements. – Avo Muromägi Oct 12 '11 at 13:20
  • can there be a process where we can first hash that array and then sort and finally find the top 20 values, so that we don't have to sort a very big array. – Sumeet Oct 12 '11 at 13:20
  • @avo Yes, exactly. Just curious, does my answer not indicate that? :-) – corsiKa Oct 12 '11 at 13:22
  • @Sumeet It would seem that way at first, but then you find yourself still doing a sort. You could try to do something that reduces the list (say takes the smallest half of the array or something) but it always ends up being slightly less efficient than the option 2, because it has to keep guessing at where the cutoff is. – corsiKa Oct 12 '11 at 13:24
  • Your assumption that two algorithms with the same O are equally fast is incorrect. The second approach will be faster for much smaller arrays than 1 million. – Klas Lindbäck Oct 12 '11 at 13:46
  • @Klas The worst case breakeven point is on the order of 1 million. Obviously an analysis of the algorithm would have to be done to find an exact breakeven point, which would probably be machine dependent. The main idea is to prove the existence of a break even point and then move it as appropriate. – corsiKa Oct 12 '11 at 13:55
1

If the array is realy big, sorting it would take a long time and a lot of space.

What you need:

  • Copy the first 20 elements of the array A into a new array B.

  • Sort B

  • Walk over array A and for each element check if it is smaller than B[19]

  • if yes => add it to B, sort B, delete the last element of B

PiTheNumber
  • 22,828
  • 17
  • 107
  • 180
  • Using a sorted array will unnecessarily increase the complexity. You would be much better off using a linked list which would have a `O(k)` insert cost, as opposed to the array which would be `O(k lg k)`. – corsiKa Oct 12 '11 at 13:25
0

Not sure if it will be optimal but you can try to run 20 iteration of inserition sort.

Marcin
  • 231
  • 1
  • 6
  • 13
  • Of course complexity of that algorithm is 20*n, where n is a array's length. – Marcin Oct 12 '11 at 13:16
  • That actually is optimal depending on the size. It's not so optimal on a "big" array of size 21. :-) – corsiKa Oct 12 '11 at 13:21
  • Better than sorting the whole thing, even if the 2^20 thing applies. Not sure I see any way out of it this time, though... – Patrick87 Oct 12 '11 at 13:27
0

For God's sake, don't sort the whole array. Have an array of size 20 initialized to the first 20 elements of the big array. Now, step through the big array, replacing any element in the small array bigger than the one from the big array you are currently considering. This is O(n); better than any comparison based sort will ever do, and possibly more efficient (with a good implementation) than linear sorts (which can't always be used, anyway).

EDIT:

So, out of curiosity, I implemented the naive version of the linear algorithm and compared it to the C++ STL sort() function. Here are my results - they show that, as I expected, the linear algorithm is, on average, always better than sorting - even if, in the theoretical worst case for the linear algorithm, you would need a larger array for it to win. Here are my performance figures:

        N        Sort      Linear      Common
       32,        378,        170,        116
       64,        831,        447,        237
      128,       1741,       1092,        424
      256,       5260,       2211,        865
      512,      10955,       5944,       1727
     1024,      20451,      10529,       3584
     2048,      38459,      21723,       7011
     4096,      77697,      41023,      14136
     8192,     150630,      82919,      28083
    16384,     311593,     166740,      55978
    32768,     648331,     334612,     111891
    65536,    1329827,     673030,     224665
   131072,    2802540,    1342430,     449553
   262144,    5867379,    2717356,     896673
   524288,   12082264,    5423038,    1798905
  1048576,   25155593,   10941005,    3658716
  2097152,   62429382,   24501189,    8940410
  4194304,  120370652,   44820562,   14843411

N is the problem size, Sort is the sort time in microseconds, Linear is the linear algorithm time in microseconds, and Common is the time spent randomizing the array before each of the tests. Note that to get just the time spent in the Sort and Linear algorithms, you would need to subtract from the values in columns two and three the values in column four. If you would like me to do this, I would be happy. Still, it's clear that linear is faster than sorting. Each N was tested 100 times, and these are aggregate figures (summed times) from all 100 tests. Here is the code I used:

  void randomize(unsigned char *data, int n) {
     for(int i = 0; i < n; i++)
        data[i] = (unsigned char)(rand() % 256);

  }

  void sorttest(unsigned char *data, int n) {
     unsigned char results[20];
     sort(data, data + n);
     for(int i = 0; i < 20; i++)
        results[i] = data[i];
  }

  void scantest(unsigned char *data, int n) {
     unsigned char results[20];
     for(int i = 0; i < 20; i++)
        results[i] = data[i];

     for(int i = 20; i < n; i++)
        for(int j = 0; j < 20; j++)
           if(data[i] < results[j]) {
              results[j] = data[i];
              break;
           }
  }


  void dotest(int n)
  {
     unsigned char *data = (unsigned char*)malloc(n);
     timeval t1, t2, t3, t4, t5, t6;

     gettimeofday(&t1, 0);
     for(int i = 0; i < 100; i++) {
        randomize(data, n);
        sorttest(data, n);
     }
     gettimeofday(&t2, 0);


     gettimeofday(&t3, 0);
     for(int i = 0; i < 100; i++) {
        randomize(data, n);
        scantest(data, n);
     }
     gettimeofday(&t4, 0);

     gettimeofday(&t5, 0);
     for(int i = 0; i < 100; i++)
        randomize(data, n);
     gettimeofday(&t6, 0);

     int dt1 = 1000000*(t2.tv_sec - t1.tv_sec) + (t2.tv_usec - t1.tv_usec);
     int dt2 = 1000000*(t4.tv_sec - t3.tv_sec) + (t4.tv_usec - t3.tv_usec);
     int dt3 = 1000000*(t6.tv_sec - t5.tv_sec) + (t6.tv_usec - t5.tv_usec);
     printf("%10d, %10d, %10d, %10d\n", n, dt1, dt2, dt3);
     free(data);
  }

  int main() {
     srand(time(0));
     for(int i = 32; i < 5000000; i*=2) dotest(i);
     return 0;
  }

I invite anybody who is claiming that sorting is just as good to point out how I can modify this benchmark to be more fair/correct so that sorting comes out on top. No really; feel free to experiment with it yourselves.

Patrick87
  • 27,682
  • 3
  • 38
  • 73
  • 1
    Well, if the "big" array was of size 40, it would be faster to sort it and get the smallest 20. For the smallest 20 elements, it's faster to "sort and cut" for arrays smaller than 2^20. – corsiKa Oct 12 '11 at 13:20
  • The 2^20 number is only accurate if you only count comparisons AND you use the most naive implementation method: keeping the small array unsorted and checking the whole thing each time. If you think about it a bit, more efficient methods should come to mind... If you can cut the 20 comparisons to 10 on average, the array's size would only need to be 1024. – Patrick87 Oct 12 '11 at 13:25
  • Big O notation takes into account worst case, not average case. :-) You're right that the theoretical cutoff and the practical cutoff would be different. In my case, I would rather have the "inefficient" method (which still runs in linear time, mind you) on the small arrays just so I write one method to maintain. The performance gain of using the sort on the smaller arrays and the secondary list on the bigger arrays is so small I would hardly consider it at all, honestly. I just was putting some considerations out there. :-) – corsiKa Oct 12 '11 at 13:28
  • For what it's worth, the bound (e.g. Big Oh) is orthogonal (different from, unrelated to) the case (best, average, worst). There was a good thread a few weeks ago about the distinction. Still, it's possible to do better than 20 searches each time *in the worst case*... anyway. – Patrick87 Oct 12 '11 at 13:34