22

Let's say I have a collection of some type, e.g.

IEnumerable<double> values;

Now I need to extract the k highest values from that collection, for some parameter k. This is a very simple way to do this:

values.OrderByDescending(x => x).Take(k)

However, this (if I understand this correctly) first sorts the entire list, then picks the first k elements. But if the list is very large, and k is comparatively small (smaller than log n), this is not very efficient - the list is sorted in O(nlog n), but I figure selecting the k highest values from a list should be more like O(nk).

So, does anyone have any suggestion for a better, more efficient way to do this?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Henrik Berg
  • 519
  • 5
  • 21
  • 6
    This is known as a selection algorithm. See http://en.wikipedia.org/wiki/Selection_algorithm (it says "K smallest" but you can find the "K largest" by reversing the ordering comparison, of course). "Partial sorting" is a special case, which is more what you want: http://en.wikipedia.org/wiki/Partial_sorting – Matthew Watson Feb 26 '13 at 12:43
  • 1
    Related: [Fast Algorithm for computing percentiles to remove outliers](http://stackoverflow.com/questions/3779763/fast-algorithm-for-computing-percentiles-to-remove-outliers) – sloth Feb 26 '13 at 12:49
  • I guess another solution would be to sort it **when items are added** (instead of when accessing). That way, you avoid needing to sort it. – default Feb 26 '13 at 12:58
  • I nearly forgot, but +1 for realizing that `OrderBy(...).Take(...)` is inefficient. The number of times I've seen `OrderBy(...).First()` here and elsewhere is depressing. It would have been interesting if Microsoft had baked this into LINQ by making special overloads of `Take`, `First` etc. that worked on an `IOrderedEnumerable`. – Rawling Feb 26 '13 at 13:05
  • 1
    It's amazing that there isn't an easy to find C# implementation of a partial sort, especially considering it's built into C++'s STL library! – Matthew Watson Feb 26 '13 at 13:12

5 Answers5

8

This gives a bit of a performance increase. Note that it's ascending rather than descending but you should be able to repurpose it (see comments):

static IEnumerable<double> TopNSorted(this IEnumerable<double> source, int n)
{
    List<double> top = new List<double>(n + 1);
    using (var e = source.GetEnumerator())
    {
        for (int i = 0; i < n; i++)
        {
            if (e.MoveNext())
                top.Add(e.Current);
            else
                throw new InvalidOperationException("Not enough elements");
        }
        top.Sort();
        while (e.MoveNext())
        {
            double c = e.Current;
            int index = top.BinarySearch(c);
            if (index < 0) index = ~index;
            if (index < n)                    // if (index != 0)
            {
                top.Insert(index, c);
                top.RemoveAt(n);              // top.RemoveAt(0)
            }
        }
    }
    return top;  // return ((IEnumerable<double>)top).Reverse();
}
Rawling
  • 49,248
  • 7
  • 89
  • 127
  • Could also be an extension method to "work with LINQ" so to say. – default Feb 26 '13 at 12:53
  • And then it's not `O(n*k)` it's `O(n*k*k*logk)` something – default locale Feb 26 '13 at 12:54
  • @Default Whoops yes, I never bother when knocking these things together and I forgot to put it in :) – Rawling Feb 26 '13 at 12:58
  • @defaultlocale ... Is that a good thing? It seems faster at selecting 50 of 10k elements but I've not really thought about the n/k behaviour. (I'd have thought `n log k` as you're making `n` binary inserts into a group of size `k`.) – Rawling Feb 26 '13 at 12:59
  • @Rawling it looks like `O(n*k*logk)` so, it's close to OP's request. – default locale Feb 26 '13 at 13:02
  • @defaultlocale I tested it and it does look like you're right, now I need to figure out why :) – Rawling Feb 26 '13 at 13:21
  • Yes, looks like O(n*k*logk) because of the binary search through k elements repeated at each iteration, but anyway I think you can't avoid this. – Henrik Berg Feb 26 '13 at 13:49
  • Thanks for the answer, btw (and to everyone else for theirs)! – Henrik Berg Feb 26 '13 at 13:50
  • @Henrik But a binary search through `k` elements should only be `log k`, and there are `n` iterations... I still don't get where the extra factor of `k` comes from. (But from varying `k` it does look like it is there!) – Rawling Feb 26 '13 at 13:50
  • @Dzienny Does the `k+` come from the updating the internal list? I was just counting comparisons, but that could be the limiting factor instead. – Rawling Feb 26 '13 at 13:59
  • @Rawling Both `Insert` and `RemoveAt` are O(n) operations. – Dzienny Feb 26 '13 at 14:17
  • @Dzienny Yeah... I was under the impression that when measuring sorts you treated comparisons as the expensive part and discounted trivial things like shifting items about, but I guess not. Still, this doesn't explain why I'm seeing `nk logk` rather than just `nk`(The `+ logk` disappears asymptotically, right?) – Rawling Feb 26 '13 at 14:22
  • Bit late but assuming that `k = wanted sorted elements` and `n = number elements in source` (usual conventions) this is `O(k + k log k + (n - k) * (log k + 2k))`. `O(k)` is the first loop. The sort is `k log k`. The later loop is run `n-k` times and does a binary search (`log k`), and add/remove a value (`2k`). Usually you get `O(n + k log k)` for this which is asymptotically better. – Voo Jun 10 '14 at 02:22
2

Consider the below method:

static IEnumerable<double> GetTopValues(this IEnumerable<double> values, int count)
{
    var maxSet = new List<double>(Enumerable.Repeat(double.MinValue, count));
    var currentMin = double.MinValue;

    foreach (var t in values)
    {
        if (t <= currentMin) continue;
        maxSet.Remove(currentMin);
        maxSet.Add(t);
        currentMin = maxSet.Min();
    }

    return maxSet.OrderByDescending(i => i);
}

And the test program:

static void Main()
{
    const int SIZE = 1000000;
    const int K = 10;
    var random = new Random();

    var values = new double[SIZE];
    for (var i = 0; i < SIZE; i++)
        values[i] = random.NextDouble();

    // Test values
    values[SIZE/2] = 2.0;
    values[SIZE/4] = 3.0;
    values[SIZE/8] = 4.0;

    IEnumerable<double> result;

    var stopwatch = new Stopwatch();

    stopwatch.Start();
    result = values.OrderByDescending(x => x).Take(K).ToArray();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);

    stopwatch.Restart();
    result = values.GetTopValues(K).ToArray();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

On my machine results are 1002 and 14.

Ryszard Dżegan
  • 24,366
  • 6
  • 38
  • 56
0

Another way of doing this (haven't been around C# for years, so pseudo-code it is, sorry) would be:

highestList = []
lowestValueOfHigh = 0
   for every item in the list
        if(lowestValueOfHigh > item) {
             delete highestList[highestList.length - 1] from list
             do insert into list with binarysearch
             if(highestList[highestList.length - 1] > lowestValueOfHigh)
                     lowestValueOfHigh = highestList[highestList.length - 1]
   }
Sebastian van Wickern
  • 1,699
  • 3
  • 15
  • 31
0

I wouldn't state anything about performance without profiling. In this answer I'll just try to implement O(n*k) take-one-enumeration-for-one-max-value approach. Personally I think that ordering approach is superior. Anyway:

public static IEnumerable<double> GetMaxElements(this IEnumerable<double> source)
    {
        var usedIndices = new HashSet<int>();
        while (true)
        {
            var enumerator = source.GetEnumerator();
            int index = 0;
            int maxIndex = 0;
            double? maxValue = null;
            while(enumerator.MoveNext())
            {
                if((!maxValue.HasValue||enumerator.Current>maxValue)&&!usedIndices.Contains(index))
                {
                    maxValue = enumerator.Current;
                    maxIndex = index;
                }
                index++;
            }
            usedIndices.Add(maxIndex);
            if (!maxValue.HasValue) break;
            yield return maxValue.Value;
        }
    }

Usage:

var biggestElements = values.GetMaxElements().Take(3);

Downsides:

  1. Method assumes that source IEnumerable has an order
  2. Method uses additional memory/operations to save used indices.

Advantage:

  • You can be sure that it takes one enumeration to get next max value.

See it running

default locale
  • 13,035
  • 13
  • 56
  • 62
0

Here is a Linqy TopN operator for enumerable sequences, based on the PriorityQueue<TElement, TPriority> collection:

/// <summary>
/// Selects the top N elements from the source sequence. The selected elements
/// are returned in descending order.
/// </summary>
public static IEnumerable<T> TopN<T>(this IEnumerable<T> source, int n,
    IComparer<T> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    if (n < 1) throw new ArgumentOutOfRangeException(nameof(n));
    PriorityQueue<bool, T> top = new(comparer);
    foreach (var item in source)
    {
        if (top.Count < n)
            top.Enqueue(default, item);
        else
            top.EnqueueDequeue(default, item);
    }
    List<T> topList = new(top.Count);
    while (top.TryDequeue(out _, out var item)) topList.Add(item);
    for (int i = topList.Count - 1; i >= 0; i--) yield return topList[i];
}

Usage example:

IEnumerable<double> topValues = values.TopN(k);

The topValues sequence contains the k maximum values in the values, in descending order. In case there are duplicate values in the topValues, the order of the equal values is undefined (non-stable sort).

For a SortedSet<T>-based implementation that compiles on .NET versions earlier than .NET 6, you could look at the 5th revision of this answer.

An operator PartialSort with similar functionality exists in the MoreLinq package. It's not implemented optimally though (source code). It performs invariably a binary search for each item, instead of comparing it with the smallest item in the top list, resulting in many more comparisons than necessary.

Surprisingly the LINQ itself is well optimized for the OrderByDescending+Take combination, resulting in excellent performance. It's only slightly slower than the TopN operator above. This applies to all versions of the .NET Core and later (.NET 5 and .NET 6). It doesn't apply to the .NET Framework platform, where the complexity is O(n*log n) as expected.

A demo that compares 4 different approaches can be found here. It compares:

  1. values.OrderByDescending(x => x).Take(k).
  2. values.OrderByDescending(x => x).HideIdentity().Take(k), where HideIdentity is a trivial LINQ propagator that hides the identity of the underlying enumerable, and so it effectively disables the LINQ optimizations.
  3. values.PartialSort(k, MoreLinq.OrderByDirection.Descending) (MoreLinq).
  4. values.TopN(k)

Below is a typical output of the demo, running in Release mode on .NET 6:

.NET 6.0.0-rtm.21522.10
Extract the 100 maximum elements from 2,000,000 random values, and calculate the sum.

OrderByDescending+Take              Duration:   156 msec, Comparisons:  3,129,640, Sum: 99.997344
OrderByDescending+HideIdentity+Take Duration: 1,415 msec, Comparisons: 48,602,298, Sum: 99.997344
MoreLinq.PartialSort                Duration:   277 msec, Comparisons: 13,999,582, Sum: 99.997344
TopN                                Duration:    62 msec, Comparisons:  2,013,207, Sum: 99.997344
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • I've opened an issue on the MoreLinq GitHub repository [here](https://github.com/morelinq/MoreLINQ/issues/840 "The PartialSort operator can be improved"). – Theodor Zoulias Jul 05 '22 at 13:16
  • There is also the [SuperLinq](https://github.com/viceroypenguin/SuperLinq) library, which is a fork of MoreLinq, and has an optimized `PartialSort` implementation (particularly for string keys). – Theodor Zoulias Oct 21 '22 at 01:27