Extract the k maximum elements of a list

Question

Let's say I have a collection of some type, e.g.

IEnumerable<double> values;

Now I need to extract the k highest values from that collection, for some parameter k. This is a very simple way to do this:

values.OrderByDescending(x => x).Take(k)

However, this (if I understand this correctly) first sorts the entire list, then picks the first k elements. But if the list is very large, and k is comparatively small (smaller than log n), this is not very efficient - the list is sorted in O(nlog n), but I figure selecting the k highest values from a list should be more like O(nk).

So, does anyone have any suggestion for a better, more efficient way to do this?

This is known as a selection algorithm. See http://en.wikipedia.org/wiki/Selection_algorithm (it says "K smallest" but you can find the "K largest" by reversing the ordering comparison, of course). "Partial sorting" is a special case, which is more what you want: http://en.wikipedia.org/wiki/Partial_sorting — Matthew Watson, Feb 26 '13 at 12:43
Related: [Fast Algorithm for computing percentiles to remove outliers](http://stackoverflow.com/questions/3779763/fast-algorithm-for-computing-percentiles-to-remove-outliers) — sloth, Feb 26 '13 at 12:49
I guess another solution would be to sort it **when items are added** (instead of when accessing). That way, you avoid needing to sort it. — default, Feb 26 '13 at 12:58
I nearly forgot, but +1 for realizing that `OrderBy(...).Take(...)` is inefficient. The number of times I've seen `OrderBy(...).First()` here and elsewhere is depressing. It would have been interesting if Microsoft had baked this into LINQ by making special overloads of `Take`, `First` etc. that worked on an `IOrderedEnumerable`. — Rawling, Feb 26 '13 at 13:05
It's amazing that there isn't an easy to find C# implementation of a partial sort, especially considering it's built into C++'s STL library! — Matthew Watson, Feb 26 '13 at 13:12

Rawling · Accepted Answer · 2013-02-26T12:57:09.577

8

This gives a bit of a performance increase. Note that it's ascending rather than descending but you should be able to repurpose it (see comments):

static IEnumerable<double> TopNSorted(this IEnumerable<double> source, int n)
{
    List<double> top = new List<double>(n + 1);
    using (var e = source.GetEnumerator())
    {
        for (int i = 0; i < n; i++)
        {
            if (e.MoveNext())
                top.Add(e.Current);
            else
                throw new InvalidOperationException("Not enough elements");
        }
        top.Sort();
        while (e.MoveNext())
        {
            double c = e.Current;
            int index = top.BinarySearch(c);
            if (index < 0) index = ~index;
            if (index < n)                    // if (index != 0)
            {
                top.Insert(index, c);
                top.RemoveAt(n);              // top.RemoveAt(0)
            }
        }
    }
    return top;  // return ((IEnumerable<double>)top).Reverse();
}

edited Feb 26 '13 at 12:57

answered Feb 26 '13 at 12:51

Rawling

49,248
7
89
127

Could also be an extension method to "work with LINQ" so to say. – default Feb 26 '13 at 12:53
And then it's not `O(n*k)` it's `O(n*k*k*logk)` something – default locale Feb 26 '13 at 12:54
@Default Whoops yes, I never bother when knocking these things together and I forgot to put it in :) – Rawling Feb 26 '13 at 12:58
@defaultlocale ... Is that a good thing? It seems faster at selecting 50 of 10k elements but I've not really thought about the n/k behaviour. (I'd have thought `n log k` as you're making `n` binary inserts into a group of size `k`.) – Rawling Feb 26 '13 at 12:59
@Rawling it looks like `O(n*k*logk)` so, it's close to OP's request. – default locale Feb 26 '13 at 13:02
@defaultlocale I tested it and it does look like you're right, now I need to figure out why :) – Rawling Feb 26 '13 at 13:21
Yes, looks like O(n*k*logk) because of the binary search through k elements repeated at each iteration, but anyway I think you can't avoid this. – Henrik Berg Feb 26 '13 at 13:49
Thanks for the answer, btw (and to everyone else for theirs)! – Henrik Berg Feb 26 '13 at 13:50
@Henrik But a binary search through `k` elements should only be `log k`, and there are `n` iterations... I still don't get where the extra factor of `k` comes from. (But from varying `k` it does look like it is there!) – Rawling Feb 26 '13 at 13:50
@Dzienny Does the `k+` come from the updating the internal list? I was just counting comparisons, but that could be the limiting factor instead. – Rawling Feb 26 '13 at 13:59
@Rawling Both `Insert` and `RemoveAt` are O(n) operations. – Dzienny Feb 26 '13 at 14:17
@Dzienny Yeah... I was under the impression that when measuring sorts you treated comparisons as the expensive part and discounted trivial things like shifting items about, but I guess not. Still, this doesn't explain why I'm seeing `nk logk` rather than just `nk`(The `+ logk` disappears asymptotically, right?) – Rawling Feb 26 '13 at 14:22
Bit late but assuming that `k = wanted sorted elements` and `n = number elements in source` (usual conventions) this is `O(k + k log k + (n - k) * (log k + 2k))`. `O(k)` is the first loop. The sort is `k log k`. The later loop is run `n-k` times and does a binary search (`log k`), and add/remove a value (`2k`). Usually you get `O(n + k log k)` for this which is asymptotically better. – Voo Jun 10 '14 at 02:22

Ryszard Dżegan · Answer 2 · 2013-02-26T14:08:23.960

2

Consider the below method:

static IEnumerable<double> GetTopValues(this IEnumerable<double> values, int count)
{
    var maxSet = new List<double>(Enumerable.Repeat(double.MinValue, count));
    var currentMin = double.MinValue;

    foreach (var t in values)
    {
        if (t <= currentMin) continue;
        maxSet.Remove(currentMin);
        maxSet.Add(t);
        currentMin = maxSet.Min();
    }

    return maxSet.OrderByDescending(i => i);
}

And the test program:

static void Main()
{
    const int SIZE = 1000000;
    const int K = 10;
    var random = new Random();

    var values = new double[SIZE];
    for (var i = 0; i < SIZE; i++)
        values[i] = random.NextDouble();

    // Test values
    values[SIZE/2] = 2.0;
    values[SIZE/4] = 3.0;
    values[SIZE/8] = 4.0;

    IEnumerable<double> result;

    var stopwatch = new Stopwatch();

    stopwatch.Start();
    result = values.OrderByDescending(x => x).Take(K).ToArray();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);

    stopwatch.Restart();
    result = values.GetTopValues(K).ToArray();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

On my machine results are 1002 and 14.

edited Feb 26 '13 at 14:08

answered Feb 26 '13 at 13:03

Ryszard Dżegan

24,366
6
38
56

@DominicKexel: Yes, but natural numbers are never negative. – Ryszard Dżegan Feb 26 '13 at 13:23
@DominicKexel: I used natural numbers in order to not obscure the algorithm. – Ryszard Dżegan Feb 26 '13 at 13:24
@DominicKexel: The idea is to keep max values in the list during looping through all values and then return sorted max values. – Ryszard Dżegan Feb 26 '13 at 13:26
Ah, I missed that you already said it's for natural numbers only. – sloth Feb 26 '13 at 13:35
@DominicKexel: I transformed it to *double* with negatives and as extension method. – Ryszard Dżegan Feb 26 '13 at 13:37
1

Will still not work: You'll have to initialize your list with `double.MinValue` instead of `0`s, e.g.: `var maxSet = new List(Enumerable.Repeat(double.MinValue, count));` :-) – sloth Feb 26 '13 at 13:45
@DominicKexel: I missed that! Thank you :) – Ryszard Dżegan Feb 26 '13 at 14:08

score 0 · Answer 3 · answered Feb 26 '13 at 12:44

Another way of doing this (haven't been around C# for years, so pseudo-code it is, sorry) would be:

highestList = []
lowestValueOfHigh = 0
   for every item in the list
        if(lowestValueOfHigh > item) {
             delete highestList[highestList.length - 1] from list
             do insert into list with binarysearch
             if(highestList[highestList.length - 1] > lowestValueOfHigh)
                     lowestValueOfHigh = highestList[highestList.length - 1]
   }

default locale · Answer 4 · 2013-02-26T13:38:30.803

I wouldn't state anything about performance without profiling. In this answer I'll just try to implement O(n*k) take-one-enumeration-for-one-max-value approach. Personally I think that ordering approach is superior. Anyway:

public static IEnumerable<double> GetMaxElements(this IEnumerable<double> source)
    {
        var usedIndices = new HashSet<int>();
        while (true)
        {
            var enumerator = source.GetEnumerator();
            int index = 0;
            int maxIndex = 0;
            double? maxValue = null;
            while(enumerator.MoveNext())
            {
                if((!maxValue.HasValue||enumerator.Current>maxValue)&&!usedIndices.Contains(index))
                {
                    maxValue = enumerator.Current;
                    maxIndex = index;
                }
                index++;
            }
            usedIndices.Add(maxIndex);
            if (!maxValue.HasValue) break;
            yield return maxValue.Value;
        }
    }

Usage:

var biggestElements = values.GetMaxElements().Take(3);

Downsides:

Method assumes that source IEnumerable has an order
Method uses additional memory/operations to save used indices.

Advantage:

You can be sure that it takes one enumeration to get next max value.

See it running

Theodor Zoulias · Answer 5 · 2022-07-10T08:03:34.907

Here is a Linqy TopN operator for enumerable sequences, based on the PriorityQueue<TElement, TPriority> collection:

/// <summary>
/// Selects the top N elements from the source sequence. The selected elements
/// are returned in descending order.
/// </summary>
public static IEnumerable<T> TopN<T>(this IEnumerable<T> source, int n,
    IComparer<T> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    if (n < 1) throw new ArgumentOutOfRangeException(nameof(n));
    PriorityQueue<bool, T> top = new(comparer);
    foreach (var item in source)
    {
        if (top.Count < n)
            top.Enqueue(default, item);
        else
            top.EnqueueDequeue(default, item);
    }
    List<T> topList = new(top.Count);
    while (top.TryDequeue(out _, out var item)) topList.Add(item);
    for (int i = topList.Count - 1; i >= 0; i--) yield return topList[i];
}

Usage example:

IEnumerable<double> topValues = values.TopN(k);

The topValues sequence contains the k maximum values in the values, in descending order. In case there are duplicate values in the topValues, the order of the equal values is undefined (non-stable sort).

For a SortedSet<T>-based implementation that compiles on .NET versions earlier than .NET 6, you could look at the 5th revision of this answer.

An operator PartialSort with similar functionality exists in the MoreLinq package. It's not implemented optimally though (source code). It performs invariably a binary search for each item, instead of comparing it with the smallest item in the top list, resulting in many more comparisons than necessary.

Surprisingly the LINQ itself is well optimized for the OrderByDescending+Take combination, resulting in excellent performance. It's only slightly slower than the TopN operator above. This applies to all versions of the .NET Core and later (.NET 5 and .NET 6). It doesn't apply to the .NET Framework platform, where the complexity is O(n*log n) as expected.

A demo that compares 4 different approaches can be found here. It compares:

values.OrderByDescending(x => x).Take(k).
values.OrderByDescending(x => x).HideIdentity().Take(k), where HideIdentity is a trivial LINQ propagator that hides the identity of the underlying enumerable, and so it effectively disables the LINQ optimizations.
values.PartialSort(k, MoreLinq.OrderByDirection.Descending) (MoreLinq).
values.TopN(k)

Below is a typical output of the demo, running in Release mode on .NET 6:

.NET 6.0.0-rtm.21522.10
Extract the 100 maximum elements from 2,000,000 random values, and calculate the sum.

OrderByDescending+Take              Duration:   156 msec, Comparisons:  3,129,640, Sum: 99.997344
OrderByDescending+HideIdentity+Take Duration: 1,415 msec, Comparisons: 48,602,298, Sum: 99.997344
MoreLinq.PartialSort                Duration:   277 msec, Comparisons: 13,999,582, Sum: 99.997344
TopN                                Duration:    62 msec, Comparisons:  2,013,207, Sum: 99.997344

I've opened an issue on the MoreLinq GitHub repository [here](https://github.com/morelinq/MoreLINQ/issues/840 "The PartialSort operator can be improved"). — Theodor Zoulias, Jul 05 '22 at 13:16
There is also the [SuperLinq](https://github.com/viceroypenguin/SuperLinq) library, which is a fork of MoreLinq, and has an optimized `PartialSort` implementation (particularly for string keys). — Theodor Zoulias, Oct 21 '22 at 01:27

Extract the k maximum elements of a list

5 Answers5

Linked