Fastest way to partially sort array of integers with repeating values into buckets

Question

Let's say I have a large unsorted array of integers (C/C++) that mostly repeat a small range of values. For example, if I start with the following array:

{ 0, 3, 3, 3, 0, 1, 1, 1, 3, 2, 2, 3, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 1}

I'd like to end up with this:

{ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3}

In actuality, my array will have thousands of elements, but the range of values they can have will still be relatively small, like a dozen or so possible values.

My problem is that traditional sorting algorithms (qsort, mergesort, etc) seem a bit overkill, as they will try to ensure that every single element is in its proper position. But I'm looking for an algorithm that only cares to group elements into "buckets" and knows to terminate as soon as that has been achieved.

Also: *"traditional sorting algorithms (qsort, mergesort, etc) seem a bit overkill"* How so? Do they not run fast enough, or what is the problem? — Baum mit Augen, Feb 12 '18 at 01:16
@BaummitAugen I have a choice of which to use here. They do not run fast enough is the problem, I need something that can terminate early. I do not need someone to implement the whole algorithm for me, just a pointer in the right direction to something that could be useful for my situation. — Sunny724, Feb 12 '18 at 01:17
Then do make that choice, please. Different languages will yield different solutions. — Baum mit Augen, Feb 12 '18 at 01:18
*hat traditional sorting algorithms seem a bit overkill* why? You have a sorting problem, they do exactly that. — Pablo, Feb 12 '18 at 01:18
@Pablo traditional sorting algorithms will try to make sure every element is in sorted order, I just want something that groups elements into buckets. — Sunny724, Feb 12 '18 at 01:19
@Sunny724 fair enough. Do you need to sort to be in place or are you happy with just a copy? — Pablo, Feb 12 '18 at 01:22
@Pablo in place would be preferable, but I can live with a copy if necessary — Sunny724, Feb 12 '18 at 01:23
I've made an update of my answer, as coderredoc pointed out in the comments, my version didn't deal with negative numbers. I fixed that. — Pablo, Feb 12 '18 at 03:53

John Zwinck · Answer 1 · 2018-02-12T01:31:57.863

4

Use a map:

map<int, unsigned> counts;
for (auto value: values)
    ++counts[value];

auto it = begin(values);
for (auto value_count : counts)
    while (value_count.second--)
        *it++ = value_count.first;

That is, create an ordered mapping of values to counts, then use it to overwrite (or create elsewhere) the entire array with the correct count of each value.

Of course, if the values are always integers within a small range, you can use an array instead of the map--for your example with values in [0,3]:

array<unsigned, 4> counts = {};
for (auto value: values)
    ++counts[value];

edited Feb 12 '18 at 01:31

answered Feb 12 '18 at 01:20

John Zwinck

239,568
38
324
436

I'm not looking to count the elements, but partially sort them into buckets. – Sunny724 Feb 12 '18 at 01:26
3

@Sunny724 The counting is used here as a sorting mechanism. Since you have many duplicates, counting how many of each that you have can be though of as putting each value into a bucket of its own. Since the value in each bucket is the same, you just track how many items are in the bucket, then write out the entire sorted array. – 1201ProgramAlarm Feb 12 '18 at 01:28
1

This is a good idea as it actually reduces the data footprint without data loss – Grantly Feb 12 '18 at 01:29
Nice answer..+1 But isn't it supposed to be a `C` question? Tag says so – user2736738 Feb 12 '18 at 03:14
@coderredoc: The question was originally tagged C++ as well, the body says "C/C++", and the comments from OP stated that the algorithm matters more than the language choice. Anyway, OP accepted a C version of the same thing. – John Zwinck Feb 12 '18 at 03:17
This `C/C++` thing they do is something I don't support personally. `C` and `C++` are different languages and this I see a lot. We don't mix up tag like `Java` + `Python`. Anyway if OP reads this - it would be better to add a `C++` tag. – user2736738 Feb 12 '18 at 03:19
1

@coderredoc The OP initially tagged C and C++, John had already answer when Neil Butterworth removed both tags at some point. The OP then said in the comments (of the question) that he/she wants a solution in C, so I readded the C tag again. – Pablo Feb 12 '18 at 05:42

Pablo · Accepted Answer · 2018-02-12T03:52:08.760

Well, based on this:

unsorted array of integers that mostly repeat a small range of values

Assuming that there is a maximal value in your list, you could do this:

#include <stdio.h>
#include <string.h>

int group_vals(int *arr, size_t len, int max)
{
    int count[max+1];
    memset(count, 0, sizeof count);


    for(size_t i = 0; i < len; ++i)
        count[arr[i]]++;

    size_t index = 0;
    for(size_t i = 0; i < max + 1; ++i)
    {
        for(size_t j = 0; j < count[i]; ++j)
            arr[index++] = i;
    }
}

int main(void)
{
    int arr[] = { 0, 3, 3, 3, 0, 1, 1, 1, 3, 2, 2, 3, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 1};

    for(size_t i = 0; i < sizeof arr / sizeof *arr; ++i)
        printf("%d, ", arr[i]);
    puts("");

    group_vals(arr, sizeof arr / sizeof *arr, 3);

    for(size_t i = 0; i < sizeof arr / sizeof *arr; ++i)
        printf("%d, ", arr[i]);
    puts("");

    return 0;
}

here I know that 3 is the maximal value of the list. This outputs

0, 3, 3, 3, 0, 1, 1, 1, 3, 2, 2, 3, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 1, 
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1,

edit

NOTE: As user coderredoc pointed out in the comments, the limitation of this approach is that it only works when the original array contains positive numbers only. Improving it to deal with negative numbers is not a big problem:

int group_vals(int *arr, size_t len, int absmax)
{
    int count[2*absmax+1];
    memset(count, 0, sizeof count);


    for(size_t i = 0; i < len; ++i)
    {
        int v = arr[i];
        size_t idx;

        if(v == 0)
            idx = absmax;
        else
            idx = absmax + v;

        count[idx]++;
    }

    size_t index = 0;
    for(size_t i = 0; i < 2*absmax + 1; ++i)
    {
        int v;
        if(i == absmax)
            v = 0;
            v = i - absmax;

        for(size_t j = 0; j < count[i]; ++j)
        {
            arr[index++] = v;
        }
    }
}

Now the function expects the maximum of the absolute values of the array.

This version prints:

-2, 0, 1, 3, 2, 3, -2, -1, -1, 3, 3, 
-2, -2, -1, -1, 0, 1, 2, 3, 3, 3, 3,

PS: I didn't read John Zwinck's answer, but we both have the same idea, this is the C version of it.

@JohnZwinck that I had the same idea as you but I wrote a C solution. — Pablo, Feb 12 '18 at 02:09
I would UV but make sure you mention that you also assumed that all the array elements are non-negative. This is a difference between the two solution posted. — user2736738, Feb 12 '18 at 03:21
@coderredoc thanks for the feedback, I didn't think about negative numbers, I made an update of my answer. — Pablo, Feb 12 '18 at 03:52

Fastest way to partially sort array of integers with repeating values into buckets

2 Answers2