Here is a Linqy TopN
operator for enumerable sequences, based on the PriorityQueue<TElement, TPriority>
collection:
/// <summary>
/// Selects the top N elements from the source sequence. The selected elements
/// are returned in descending order.
/// </summary>
public static IEnumerable<T> TopN<T>(this IEnumerable<T> source, int n,
IComparer<T> comparer = default)
{
ArgumentNullException.ThrowIfNull(source);
if (n < 1) throw new ArgumentOutOfRangeException(nameof(n));
PriorityQueue<bool, T> top = new(comparer);
foreach (var item in source)
{
if (top.Count < n)
top.Enqueue(default, item);
else
top.EnqueueDequeue(default, item);
}
List<T> topList = new(top.Count);
while (top.TryDequeue(out _, out var item)) topList.Add(item);
for (int i = topList.Count - 1; i >= 0; i--) yield return topList[i];
}
Usage example:
IEnumerable<double> topValues = values.TopN(k);
The topValues
sequence contains the k
maximum values in the values
, in descending order. In case there are duplicate values in the topValues
, the order of the equal values is undefined (non-stable sort).
For a SortedSet<T>
-based implementation that compiles on .NET versions earlier than .NET 6, you could look at the 5th revision of this answer.
An operator PartialSort
with similar functionality exists in the MoreLinq package. It's not implemented optimally though (source code). It performs invariably a binary search for each item, instead of comparing it with the smallest item in the top
list, resulting in many more comparisons than necessary.
Surprisingly the LINQ itself is well optimized for the OrderByDescending
+Take
combination, resulting in excellent performance. It's only slightly slower than the TopN
operator above. This applies to all versions of the .NET Core and later (.NET 5 and .NET 6). It doesn't apply to the .NET Framework platform, where the complexity is O(n*log n) as expected.
A demo that compares 4 different approaches can be found here. It compares:
values.OrderByDescending(x => x).Take(k)
.
values.OrderByDescending(x => x).HideIdentity().Take(k)
, where HideIdentity
is a trivial LINQ propagator that hides the identity of the underlying enumerable, and so it effectively disables the LINQ optimizations.
values.PartialSort(k, MoreLinq.OrderByDirection.Descending)
(MoreLinq).
values.TopN(k)
Below is a typical output of the demo, running in Release mode on .NET 6:
.NET 6.0.0-rtm.21522.10
Extract the 100 maximum elements from 2,000,000 random values, and calculate the sum.
OrderByDescending+Take Duration: 156 msec, Comparisons: 3,129,640, Sum: 99.997344
OrderByDescending+HideIdentity+Take Duration: 1,415 msec, Comparisons: 48,602,298, Sum: 99.997344
MoreLinq.PartialSort Duration: 277 msec, Comparisons: 13,999,582, Sum: 99.997344
TopN Duration: 62 msec, Comparisons: 2,013,207, Sum: 99.997344