Why doesn't IOrderedEnumerable re-implement .Contains() to gain performance

Question

If you go here: The IOrderedEnumerableDocs and click on the .Contains() method then it takes you to here: the generalised Enumerable.Contains() docs

I take this to mean that it's just using the underlying IEnumerable implementation?

This seems strange given the potential for a more performant search given that you know you have a sorted list that you can compare against your element (e.g. to do a binary search to confirm whether the element is present, rather than enumerating the whole set?

Am I missing anything?

No, you can't really assume that this is the case just because the documentation goes to the generic implementation. There is much documentation that does that, even though there may be a custom implementation. You should look at the source code, either by using a decompiler or if the code is available via the MS source server. — Erik Funkenbusch, Jan 25 '16 at 17:11
As @ErikFunkenbusch mentioned, look at the source code instead: http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,1f646b2bb4666697,references — Camilo Terevinto, Jan 25 '16 at 17:13
The difficulty is that you need to know how the data is ordered and how you want to do the search. If you're relying on the 'natural' ordering of a type then you have an extra `where T : IComparable` constraint on the data which isn't reflected in the type of `Contains`. If you have previously ordered by a comparer, you need to know you're using the same comparer when searching. — Lee, Jan 25 '16 at 17:15
And there is no sorted list yet. `IOrderedEnumerable` only knows _how to sort_, but did not _do it_ yet! — René Vogt, Jan 25 '16 at 17:21
The great thing about extension methods is that anyone can add them - why not work on an extension method for `IOrderedEnumerable.Contains` and see just how easy or difficult it really is? — D Stanley, Jan 25 '16 at 17:43
Yeah, sorry, wasn't thinking. While my original advice still holds for pretty much anything you need to verify, an interface is just an interface. It doesn't have an implementation, so obviously IOrderedEnumerable doesn't have ANY implementation of Contains, although classes that implement IOrderedEnumerable will. — Erik Funkenbusch, Jan 25 '16 at 17:47

Jon Hanna · Accepted Answer · 2016-01-25T18:24:14.733

It's worth noting from the beginning that the fact that a given method is only documented as operating on IEnumerable<T> does not mean that it isn't optimised for given implementations or derived interfaces. In fact a great many of the methods in Enumerable take different paths for different derived interfaces and/or concrete implementations. The classic example here is that Count() takes a different path if the IEnumerable<T> it is called on implements ICollection<T> or ICollection. There are several further examples of this in the full framework, and even more in .NET Core, including some which take optimised paths for the implementation of IOrderedEnumerable<T> you get from calling OrderBy().

Some of which are my doing, because my hobby these days is contributing to .NET Core, particularly to Linq, and particularly performance improvements (though obviously if I'm hacking on something I need to increase tests on the bits I'm touching, and when doing so turns up minor bugs they get prioritised over performance improvements).

When it comes to IOrderedEnumerable, I've done things like change .OrderBy(someLambda).Skip(j).Take(k) (common paging idiom) from O(n log n) time to compute and O(j + k) time to enumerate to O(n + k log k) time to compute and O(k) time to enumerate, and .OrderBy(someLambda).First() for O(n) space and O(n log n) time to O(1) space and O(n) time, and so on.

I might look at improving other methods, and of course if I don't it's quite possible someone else would.

If I do, I would not do it as you suggest.

Firstly, to have a separate overload for IOrderedEnumerable<T> would require adding a method to the public API but only cover some cases (maybe what we're given as an IEnumerable<T> is in fact an IOrderedEnumerable<T>). Much better to just have an overload for IEnumerable<T> and detect the IOrderedEnumerable<T> case.

Secondly to use binary search the we would have to know the means by which the IOrderedEnumerable was sorted. This is possible with the OrderedEnumerable<TElement, TKey> created by calls of OrderBy but not more generally.

Thirdly, it would not be the biggest possible gain.

The current costs of source.OrderBy(someLambda).Contains(someItem) are as follows:

Buffer source: O(n) space, O(n) time.
Sort the buffer: O(n log n) time (average, O(n²) worse).
Find the an item that matches someItem, or confirm none exists.: O(n) time.

If Contains() was optimsed to use binary search it would become:

Buffer source: O(n) space, O(n) time.
Sort the buffer: O(n log n) time (average, O(n²) worse).
Find the an item that matches someItem, or confirm none exists.: O(log n) time (average, O(n) worse because a precise match may sort at the same level as all elements and have to be compared with all of them).

However, that's a complete waste. If we want to optimise Contains() (and a great many other aggregate methods for that matter) the optimum stragey would be:

Call source.Contains(someItem) and return the result. This will at worse be O(n) time and O(1) space, though it may be O(log n) or O(1) time if source is for example a HashSet<T> (a case that Contains() is already optimised for). In both theory and practice it will end up being faster than the buffering step above.

Implementing that change would be considerably less work, and a much bigger gain.

I've considered this, and might indeed submit such a PR, but I'm not yet sure if on balance it's worth it (and hence what my opinion would be if someone else submits such a PR) since it's almost always easier for the the caller to turn ….OrderBy(foo).Contains(bar) into .Contains(bar) themselves, and the check needed by optimising for such a case would be cheap, but not entirely free.

score 3 · Answer 2 · answered Jan 25 '16 at 17:26

To be able to use Binary Search, you need some kind of sorted data structure. Perhaps a sorted array or SortedList. But you have just got a IOrderedEnumerable implementation only; the query isn't materialized yet.

Your IOrderedEnumerable could simple be created using Iterator blocks or some lazy queries(which is how it is usually generated). There is no real data structure there. There is no way you can get all the elements from IOrderedEnumerable or IEnumerable without enumerating them which is O(n).

So there is no way you can implement a Binary Search or something similar.

score 0 · Answer 3 · answered Jan 25 '16 at 18:01

So based on @Sriram's answer, but fleshing out the specific issue:

The fundamental problem here is that because you only have a generation rule, and not an instatiated dataset, then in order to do any variation of a Binary search, you first have to generate all the elements up to your upper bound, so you would already have gone past your target element. So better to just grab it then.

If your objects where really hard to compare but really easy to generate, then you could perhaps get a better performance (i.e. effectively instantiate the whole set and then do binary search, thus doing fewer comparisons than comparing every element in turn). But you'd do so at the cost of the more common case. And anyway you could achieve that by calling .ToArray() and passing the result of THAT to your binary search algorithm.

If they were really hard to compare they'd likely be at least quite hard to compare in order to sort, or else so loosely sorted that a binary search left you with a large chunk of possible matches, so you'd gain nothing then, either. — Jon Hanna, Jan 25 '16 at 18:03
@JonHanna ah, yes a good point. If the generation can't be easier than the comparing because it **is** comparing — Brondahl, Jan 25 '16 at 18:06

Why doesn't IOrderedEnumerable re-implement .Contains() to gain performance

3 Answers3