8

I have to sort a number of integers, which can have values between 30.000.000 and 350.000.000. There will be between 0 and 65.535 integers, with the average count being 20.000. RAM usage is irrelevant and speed only is important.

Later on i will also have to split them into groups, with the divide always being set whenever the gap between two of these values is >65.535, which is what i need the algorithm for.

If it makes any difference, the algorithm will be used in a Perl script.

Edit: After thinking it over and reading the answers i've come to realize something: I don't actually care about the data itself. As i really only want to find the start and end values of groups with small gaps, the sorting only needs to create buckets and can discard the data.

Edit2: After some testing and trying out the answers provided, the fastest way i found was this:

my @sort = sort {$a <=> $b} @item_offsets;
my @buckets;
my $start = shift @sort;
push @buckets, [$start,$start];
for my $item ( @sort ) {
    if ( $item < $buckets[$#buckets][1]+$gap ) {
        $buckets[$#buckets][1] = $item;
    }
    else {
        push @buckets, [$item,$item];
    }
}
say $#buckets;
Mithaldu
  • 2,393
  • 19
  • 39

6 Answers6

17

It's unlikely that you'll be able to write a sort algorithm in Perl that will perform better than Perl's builtin sort function:

@numbers = sort {$a <=> $b} @numbers;

You can experiment with the sort pragma to see if a particular algorithm is better:

use sort '_quicksort';
use sort '_mergesort';

Since your cutpoints will vary depending on the data distribution, I think you need to sort the entire list first then loop over it to do the cutting.

my $prev  = shift @numbers;  # already sorted
my @group = [$prev];
my $i     = 0;

foreach my $n (@numbers) {
    $i++ if ($n - $prev > 65535);
    push @{$group[$i]}, $n;
    $prev = $n;
}
Michael Carman
  • 30,628
  • 10
  • 74
  • 122
17

I'd just make an array of buckets before running the algorithm, one for each group of 65536 consecutive values. The buckets will contain a min and max value of their contents, but won't store the contents themselves. After running the algorithm, do a single pass over the buckets. If there are two consecutive non-empty buckets with min(bucket2)-max(bucket1) < 65536, combine them. Combining will not happen until the algorithm finishes running. Discard any empty buckets. This algorithm is linear time.

Take note of Bucket Sort.

Brian
  • 25,523
  • 18
  • 82
  • 173
  • You managed to summarize the problems really well. I've actually, while reading the replies here, pondered doing something like that, but wasn't too sure on it yet. Thanks. :) – Mithaldu Nov 12 '08 at 19:55
  • I just editted my answer and scrapped a bit of unrelated text, based on your edits. The resultant answer should be much faster, though both were linear time algorithms. – Brian Nov 12 '08 at 19:59
12

I'd use a radix sort, since you need to group the output.

FlySwat
  • 172,459
  • 74
  • 246
  • 311
5

I was just going to say radix sort, http://en.wikipedia.org/wiki/Radix_sort however that could be a bit above what you were looking to implement, Introsort is generally the accepted sorting solution for data http://en.wikipedia.org/wiki/Introsort, it's a variation of quicksort that switches to heapsort when it reaches smaller sets as it's faster on smaller sets than quicksort.

Chris Marisic
  • 32,487
  • 24
  • 164
  • 258
1

I would try this:

my @sorted = map { unpack "N" } sort map { pack "N" } @unsorted;
Leon Timmermans
  • 30,029
  • 2
  • 61
  • 110
  • I'm afraid map is a bit of black magic to me. What does that piece of code do? oO – Mithaldu Nov 12 '08 at 19:56
  • 1
    I assume the map{} is to eliminate the need for a sortsub to get a numeric comparison. The {$a <=> $b} case has been optimized since 5.6.1 so the trickery shouldn't be necessary. – Michael Carman Nov 12 '08 at 20:10
  • You have to read this from right-to-left. The map { pack "N" } @unsorted applies pack "N" to each element--turning each element into a big-endian binary number--the output gets passed to sort which has O(n log n) behavior, and each element of the sorted result gets unpacked and assigned to @sorted. – Max Lybbert Nov 13 '08 at 18:16
  • @michael: didn't know that. Interesting. – Leon Timmermans Nov 14 '08 at 02:55
0

If you use the number as an index to an array, and then increment the count of that position, you've "grouped" them, and done it in one pass.

in pseudocode:

while(morenumbers)
  sorted[[unsorted[number]]++
  number++

If the range is known ahead of time, you can reduce index the values (for example, the value-30000 to bring it into the right range).

warren
  • 32,620
  • 21
  • 85
  • 124
  • Bad idea, since the range is much greater than the number of integers (50 million vs 65 thousand), so this "one pass" will be very slow. – Rafał Dowgird Nov 12 '08 at 19:29
  • 1
    You can't get better than one pass, as you must hit each element in the unsorted list at least once in any sorting algorithm that exists. Perl code would look more like my @sorted_values; foreach my $element (@unsorted_values) { $sorted_values[$element]++; }; – Max Lybbert Nov 12 '08 at 19:34
  • Aargh! I put in line breaks to avoid getting Perl code looking that bad as a one-liner! – Max Lybbert Nov 12 '08 at 19:34
  • I tried doing something like this at first, but well, i can't really afford 300MB of RAM use, as that would make it inacceptably small. However i realized that i can make a number of buckets like that and thus keep the ram use sensible. – Mithaldu Nov 12 '08 at 19:54
  • "You can't get better than one pass" - one pass over a 50 million element array is much worse than 4 passes over 65536 integers. And you have to pass over the counting array at least once. – Rafał Dowgird Nov 12 '08 at 20:26
  • @Rafal - regardless of how many items you have, you must make at least one pass. You never "pass" over the counting array. I would love to see a way to sort a group without hitting every item at least once - if you have one, please share it. Until then, you have to hit each item at least once – warren Nov 13 '08 at 02:40
  • @warren: I agree that you have to hit every item in the input at least once. But how can you do a countsort without passing over the counting array? How do you initialize the array? How do you output the sorted integers (or detect the gaps, in this particular case)? – Rafał Dowgird Nov 13 '08 at 08:38
  • 1
    > How do you initialize the array? The array is initialized automatically by Perl. The original question asked how to sort the array, not how to output it. The original question also said "RAM usage is irrelevant and speed only is important," but apparently there was a limit to RAM as well. – Max Lybbert Nov 13 '08 at 18:08
  • > The original question asked how to sort the array, not how to output it. Sort it for the particular purpose of finding gaps. You have to iterate over the whole range to find gaps. Also, zeroing 200 MB isn't free, even if done automatically by Perl. – Rafał Dowgird Nov 13 '08 at 22:23
  • @Rafal - this is still an O(n) sort - since the list of items will be longer than the counted list; technically it's O(n+m), but that still reduces to O(n) – warren Nov 14 '08 at 16:10
  • The only solution would be to test this. – Max Lybbert Nov 14 '08 at 23:58